Parquet add memory limit #10

maoueh · 2024-11-12T19:33:33Z

Right now all the parquet bundling is done in memory so we need to keep in memory the last N blocks of data. This works from medium to small Substreams but for heavy ones like massive tokens crawling, memory usage will grow as the machine is able to handle the load.

There should be an amount of memory defined and when the approximation of rows in memory exceed that amount, we should write a temporary buffer to a scratch space.

The library we use already have this feature to write big data set. While reviewing it however, it was a all or nothing approach while I prefer that we have a fixed amount of memory. Segment that fits in memory would avoid I/O completely while big jobs or smaller machine could tweak the allocated space and disk would be used when exceeding capacity.

Ref #9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet add memory limit #10

Parquet add memory limit #10

maoueh commented Nov 12, 2024

Parquet add memory limit #10

Parquet add memory limit #10

Comments

maoueh commented Nov 12, 2024