Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiency trade-offs in state network archive storage approaches #288

Open
pipermerriam opened this issue Apr 18, 2024 · 2 comments
Open

Comments

@pipermerriam
Copy link
Member

I have a design idea that I want to explore....

There are two main approaches to archive nodes that live on different ends of an efficiency spectrum.

The naive approach of storing all trie nodes as-is which results in an efficiency loss due to the duplication. In the MPT we get something like a 15/16th waste/duplicate data in the worst case.

The "flat" approach which I don't know in intimate detail but IIUC it involves storing the state as a series of diffs. In this model I believe you gain efficiency in storage in exchange for additional compute cost as well as some amount of database index overhead for efficient retrieval. The other complexity in this approach in the portal network context is that it's unclear whether this storage model of a "series of diffs" can be easily cryptographically anchored.

My question/curiosity is around whether it is possible for portal too take advantage of this storage model to gain some efficiency in our archive storage. This would be beneficial for MPT based archive data and might be critical for Verkle. The loose structure would be some kind of "epoch" storage where a node stores all of the state changes for a section of the trie for a specific epoch of time. This is something I'd like to spend time on a whiteboard with someone on.

@pipermerriam
Copy link
Member Author

So in theory there's an epoch length for which the trade-off makes sense for this.

  • each epoch starts with a snapshot, which is inefficient in that it will duplicate all trie data that has not been touched.
  • each epoch then stores only state diffs which are efficient in avoiding storing up to 15/16ths of duplicate data for each trie node that changes.

The open research question here is at what point do these two different competing forces cross the line where the efficiency of one outpaces the inefficiency of the other...

This also wouldn't be a constant thing since over time, the total amount of untouched cold state should grow, which means epoch lengths would need to get longer and longer to maintain space/storage efficiency.

And, I'm pretty sure we might need to introduce an ERA-style format for this data and pre-compute it, since I think that we're talking about long epoch of state changes and there's probably no cheap/incremental way to prove whatever the new efficient storage format would be for this data. This would be something akin to how we've created the pre-merge accumulators for the header data. We'd probably need some sort of similar data structure that allows for easy provability for this old/cold state data.

@pipermerriam
Copy link
Member Author

I think Erigon may have already done a lot of the work for us:

https://github.com/ledgerwatch/erigon/blob/main/docs/programmers_guide/db_walkthrough.MD

That document feels like it gives a good picture of what a diff based approach would look like. Still a lot of work to figure out how to map that onto our protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant