You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now all the parquet bundling is done in memory so we need to keep in memory the last N blocks of data. This works from medium to small Substreams but for heavy ones like massive tokens crawling, memory usage will grow as the machine is able to handle the load.
There should be an amount of memory defined and when the approximation of rows in memory exceed that amount, we should write a temporary buffer to a scratch space.
The library we use already have this feature to write big data set. While reviewing it however, it was a all or nothing approach while I prefer that we have a fixed amount of memory. Segment that fits in memory would avoid I/O completely while big jobs or smaller machine could tweak the allocated space and disk would be used when exceeding capacity.
Right now all the parquet bundling is done in memory so we need to keep in memory the last N blocks of data. This works from medium to small Substreams but for heavy ones like massive tokens crawling, memory usage will grow as the machine is able to handle the load.
There should be an amount of memory defined and when the approximation of rows in memory exceed that amount, we should write a temporary buffer to a scratch space.
The library we use already have this feature to write big data set. While reviewing it however, it was a all or nothing approach while I prefer that we have a fixed amount of memory. Segment that fits in memory would avoid I/O completely while big jobs or smaller machine could tweak the allocated space and disk would be used when exceeding capacity.
Ref #9
The text was updated successfully, but these errors were encountered: