Efficiency trade-offs in state network archive storage approaches #288

pipermerriam · 2024-04-18T18:00:00Z

I have a design idea that I want to explore....

There are two main approaches to archive nodes that live on different ends of an efficiency spectrum.

The naive approach of storing all trie nodes as-is which results in an efficiency loss due to the duplication. In the MPT we get something like a 15/16th waste/duplicate data in the worst case.

The "flat" approach which I don't know in intimate detail but IIUC it involves storing the state as a series of diffs. In this model I believe you gain efficiency in storage in exchange for additional compute cost as well as some amount of database index overhead for efficient retrieval. The other complexity in this approach in the portal network context is that it's unclear whether this storage model of a "series of diffs" can be easily cryptographically anchored.

My question/curiosity is around whether it is possible for portal too take advantage of this storage model to gain some efficiency in our archive storage. This would be beneficial for MPT based archive data and might be critical for Verkle. The loose structure would be some kind of "epoch" storage where a node stores all of the state changes for a section of the trie for a specific epoch of time. This is something I'd like to spend time on a whiteboard with someone on.

pipermerriam · 2024-04-20T21:45:52Z

So in theory there's an epoch length for which the trade-off makes sense for this.

each epoch starts with a snapshot, which is inefficient in that it will duplicate all trie data that has not been touched.
each epoch then stores only state diffs which are efficient in avoiding storing up to 15/16ths of duplicate data for each trie node that changes.

The open research question here is at what point do these two different competing forces cross the line where the efficiency of one outpaces the inefficiency of the other...

This also wouldn't be a constant thing since over time, the total amount of untouched cold state should grow, which means epoch lengths would need to get longer and longer to maintain space/storage efficiency.

And, I'm pretty sure we might need to introduce an ERA-style format for this data and pre-compute it, since I think that we're talking about long epoch of state changes and there's probably no cheap/incremental way to prove whatever the new efficient storage format would be for this data. This would be something akin to how we've created the pre-merge accumulators for the header data. We'd probably need some sort of similar data structure that allows for easy provability for this old/cold state data.

pipermerriam · 2024-05-02T14:53:57Z

I think Erigon may have already done a lot of the work for us:

https://github.com/ledgerwatch/erigon/blob/main/docs/programmers_guide/db_walkthrough.MD

That document feels like it gives a good picture of what a diff based approach would look like. Still a lot of work to figure out how to map that onto our protocol.

pipermerriam mentioned this issue May 16, 2024

Storing large SSZ objects in Portal #301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiency trade-offs in state network archive storage approaches #288

Efficiency trade-offs in state network archive storage approaches #288

pipermerriam commented Apr 18, 2024

pipermerriam commented Apr 20, 2024

pipermerriam commented May 2, 2024

Efficiency trade-offs in state network archive storage approaches #288

Efficiency trade-offs in state network archive storage approaches #288

Comments

pipermerriam commented Apr 18, 2024

pipermerriam commented Apr 20, 2024

pipermerriam commented May 2, 2024