Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet add memory limit #10

Open
maoueh opened this issue Nov 12, 2024 · 0 comments
Open

Parquet add memory limit #10

maoueh opened this issue Nov 12, 2024 · 0 comments

Comments

@maoueh
Copy link
Contributor

maoueh commented Nov 12, 2024

Right now all the parquet bundling is done in memory so we need to keep in memory the last N blocks of data. This works from medium to small Substreams but for heavy ones like massive tokens crawling, memory usage will grow as the machine is able to handle the load.

There should be an amount of memory defined and when the approximation of rows in memory exceed that amount, we should write a temporary buffer to a scratch space.

The library we use already have this feature to write big data set. While reviewing it however, it was a all or nothing approach while I prefer that we have a fixed amount of memory. Segment that fits in memory would avoid I/O completely while big jobs or smaller machine could tweak the allocated space and disk would be used when exceeding capacity.

Ref #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant