Skip to content

Commit

Permalink
Doc update (#717)
Browse files Browse the repository at this point in the history
  • Loading branch information
rkuris authored Sep 6, 2024
1 parent 56decb8 commit 734cfe2
Showing 1 changed file with 50 additions and 126 deletions.
176 changes: 50 additions & 126 deletions firewood/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,13 @@
//! a very fast storage layer for the EVM but could be used on any blockchain that
//! requires authenticated state.
//!
//! Firewood only attempts to store the latest state on-disk and will actively clean up
//! unused state when state diffs are committed. To avoid reference counting trie nodes,
//! Firewood does not copy-on-write (COW) the state trie and instead keeps
//! one latest version of the trie index on disk and applies in-place updates to it.
//! Firewood keeps some configurable number of previous states in memory to power
//! state sync (which may occur at a few roots behind the current state).
//!
//! Firewood provides OS-level crash recovery via a write-ahead log (WAL). The WAL
//! guarantees atomicity and durability in the database, but also offers
//! “reversibility”: some portion of the old WAL can be optionally kept around to
//! allow a fast in-memory rollback to recover some past versions of the entire
//! store back in memory. While running the store, new changes will also contribute
//! to the configured window of changes (at batch granularity) to access any past
//! versions with no additional cost at all.
//! Firewood only attempts to store recent revisions on-disk and will actively clean up
//! unused older revisions when state diffs are committed. The number of revisions is
//! configured when the database is opened.
//!
//! Firewood provides OS-level crash recovery, but not machine-level crash recovery. That is,
//! if the firewood process crashes, the OS will flush the cache leave the system in a valid state.
//! No protection is (currently) offered to handle machine failures.
//!
//! # Design Philosophy & Overview
//!
Expand Down Expand Up @@ -62,124 +55,55 @@
//! benefit from such a design.
//!
//! In Firewood, we take a closer look at the second regime and have come up with a simple but
//! robust architecture that fulfills the need for such blockchain storage.
//! robust architecture that fulfills the need for such blockchain storage. However, firewood
//! can also efficiently handle the first regime.
//!
//! ## Storage Model
//!
//! Firewood is built by three layers of abstractions that totally decouple the
//! layout/representation of the data on disk from the actual logical data structure it retains:
//!
//! - Linear, memory-like store: the `shale` crate offers a `CachedStore` abstraction for a
//! (64-bit) byte-addressable store that abstracts away the intricate method that actually persists
//! the in-memory data on the secondary storage medium (e.g., hard drive). The implementor of `CachedStore`
//! provides the functions to give the user of `CachedStore` an illusion that the user is operating upon a
//! byte-addressable memory store. It is just a "magical" array of bytes one can view and change
//! that is mirrored to the disk. In reality, the linear store will be chunked into files under a
//! directory, but the user does not have to even know about this.
//!
//! - Persistent item storage stash: `CompactStore` in `shale` defines a pool of typed objects that are
//! persisted on disk but also made accessible in memory transparently. It is built on top of `CachedStore`
//! and defines how "items" of a given type are laid out, allocated and recycled throughout their lifecycles.
//!
//! - Data structure: in Firewood, one trie is maintained by invoking `CompactStore` (see `src/merkle.rs`).
//! The data structure code is totally unaware of how its objects (i.e., nodes) are organized or
//! persisted on disk. It is as if they're just in memory, which makes it much easier to write
//! and maintain the code.
//!
//! Given the abstraction, one can easily realize the fact that the actual data that affect the
//! state of the data structure (trie) is what the linear store (`CachedStore`) keeps track of. That is,
//! a flat but conceptually large byte vector. In other words, given a valid byte vector as the
//! content of the linear store, the higher level data structure can be *uniquely* determined, there
//! is nothing more (except for some auxiliary data that are kept for performance reasons, such as caching)
//! or less than that, like a way to interpret the bytes. This nice property allows us to completely
//! separate the logical data from its physical representation, greatly simplifies the storage
//! management, and allows reusing the code. It is still a very versatile abstraction, as in theory
//! any persistent data could be stored this way -- sometimes you need to swap in a different
//! `CachedStore` implementation, but without having to touch the code for the persisted data structure.
//!
//! ## Page-based Shadowing and Revisions
//!
//! Following the idea that the tries are just a view of a linear byte store, all writes made to the
//! tries inside Firewood will eventually be consolidated into some interval writes to the linear
//! store. The writes may overlap and some frequent writes are even done to the same spot in the
//! store. To reduce the overhead and be friendly to the disk, we partition the entire 64-bit
//! virtual store into pages (yeah it appears to be more and more like an OS) and keep track of the
//! dirty pages in some `CachedStore` instantiation (see `storage::StoreRevMut`). When a
//! `db::Proposal` commits, both the recorded interval writes and the aggregated in-memory
//! dirty pages induced by this write batch are taken out from the linear store. Although they are
//! mathematically equivalent, interval writes are more compact than pages (which are 4K in size,
//! become dirty even if a single byte is touched upon) . So interval writes are fed into the WAL
//! subsystem (supported by growthring). After the WAL record is written (one record per write batch),
//! the dirty pages are then pushed to the on-disk linear store to mirror the change by some
//! asynchronous, out-of-order file writes. See the `BufferCmd::WriteBatch` part of `DiskBuffer::process`
//! for the detailed logic.
//! Firewood is built by layers of abstractions that totally decouple the layout/representation
//! of the data on disk from the actual logical data structure it retains:
//!
//! - The storage module has a [storage::NodeStore] which has a generic parameter identifying
//! the state of the nodestore, and a storage type.
//!
//! There are three states for a nodestore:
//! - [storage::Committed] for revisions that are on disk
//! - [storage::ImmutableProposal] for revisions that are proposals against committed versions
//! - [storage::MutableProposal] for revisions where nodes are still being added.
//!
//! For more information on these node states, see their associated documentation.
//!
//! The storage type is either a file or memory. Memory storage is used for creating temporary
//! merkle tries for proofs as well as testing. Nodes are identified by their offset within the
//! storage medium (a memory array or a disk file).
//!
//! ## Node caching
//!
//! Once committed, nodes never change until they expire for re-use. This means that a node cache
//! can reduce the amount of serialization and deserialization of nodes. The size of the cache, in
//! nodes, is specified when the database is opened.
//!
//! In short, a Read-Modify-Write (RMW) style normal operation flow is as follows in Firewood:
//!
//! - Traverse the trie, and that induces the access to some nodes. Suppose the nodes are not already in
//! memory, then:
//!
//! - Bring the necessary pages that contain the accessed nodes into the memory and cache them
//! (`storage::CachedStore`).
//!
//! - Make changes to the trie, and that induces the writes to some nodes. The nodes are either
//! already cached in memory (its pages are cached, or its handle `ObjRef<Node>` is still in
//! `shale::ObjCache`) or need to be brought into the memory (if that's the case, go back to the
//! second step for it).
//!
//! - Writes to nodes are converted into interval writes to the stagging `StoreRevMut` store that
//! overlays atop `CachedStore`, so all dirty pages during the current write batch will be
//! exactly captured in `StoreRevMut` (see `StoreRevMut::delta`).
//!
//! - Finally:
//!
//! - Abort: when the write batch is dropped without invoking `db::Proposal::commit`, all in-memory
//! changes will be discarded, the dirty pages from `StoreRevMut` will be dropped and the merkle
//! will "revert" back to its original state without actually having to rollback anything.
//!
//! - Commit: otherwise, the write batch is committed, the interval writes (`storage::Ash`) will be bundled
//! into a single WAL record (`storage::AshRecord`) and sent to WAL subsystem, before dirty pages
//! are scheduled to be written to the store files. Also the dirty pages are applied to the
//! underlying `CachedStore`. `StoreRevMut` becomes empty again for further write batches.
//!
//! Parts of the following diagram show this normal flow, the "staging" store (implemented by
//! `StoreRevMut`) concept is a bit similar to the staging area in Git, which enables the handling
//! of (resuming from) write errors, clean abortion of an on-going write batch so the entire store
//! state remains intact, and also reduces unnecessary premature disk writes. Essentially, we
//! copy-on-write pages in the store that are touched upon, without directly mutating the
//! underlying "master" store. The staging store is just a collection of these "shadowing" pages
//! and a reference to the its base (master) so any reads could partially hit those dirty pages
//! and/or fall through to the base, whereas all writes are captured. Finally, when things go well,
//! we "push down" these changes to the base and clear up the staging store.
//!
//! <p align="center">
//! <img src="https://ava-labs.github.io/firewood/assets/architecture.svg" width="100%">
//! </p>
//!
//! Thanks to the shadow pages, we can both revive some historical versions of the store and
//! maintain a rolling window of past revisions on-the-fly. The right hand side of the diagram
//! shows previously logged write batch records could be kept even though they are no longer needed
//! for the purpose of crash recovery. The interval writes from a record can be aggregated into
//! pages (see `storage::StoreDelta::new`) and used to reconstruct a "ghost" image of past
//! revision of the linear store (just like how staging store works, except that the ghost store is
//! essentially read-only once constructed). The shadow pages there will function as some
//! "rewinding" changes to patch the necessary locations in the linear store, while the rest of the
//! linear store is very likely untouched by that historical write batch.
//!
//! Then, with the three-layer abstraction we previously talked about, a historical trie could be
//! derived. In fact, because there is no mandatory traversal or scanning in the process, the
//! only cost to revive a historical state from the log is to just playback the records and create
//! those shadow pages. There is very little additional cost because the ghost store is summoned on an
//! on-demand manner while one accesses the historical trie.
//!
//! In the other direction, when new write batches are committed, the system moves forward, we can
//! therefore maintain a rolling window of past revisions in memory with *zero* cost. The
//! mid-bottom of the diagram shows when a write batch is committed, the persisted (master) store goes one
//! step forward, the staging store is cleared, and an extra ghost store (colored in purple) can be
//! created to hold the version of the store before the commit. The backward delta is applied to
//! counteract the change that has been made to the persisted store, which is also a set of shadow pages.
//! No change is required for other historical ghost store instances. Finally, we can phase out
//! some very old ghost store to keep the size of the rolling window invariant.
//! - Create a [storage::MutableProposal] [storage::NodeStore] from the most recent [storage::Committed] one.
//! - Traverse the trie, starting at the root. Make a new root node by duplicating the existing
//! root from the committed one and save that in memory. As you continue traversing, make copies
//! of each node accessed if they are not already in memory.
//!
//! - Make changes to the trie, in memory. Each node you've accessed is currently in memory and is
//! owned by the [storage::MutableProposal]. Adding a node simply means adding a reference to it.
//!
//! - If you delete a node, mark it as deleted in the proposal and remove the child reference to it.
//!
//! - After making all mutations, convert the [storage::MutableProposal] to an [storage::ImmutableProposal]. This
//! involves walking the in-memory trie and looking for nodes without disk addresses, then assigning
//! them from the freelist of the parent. This gives the node an address, but it is stil in
//! memory.
//!
//! - Since the root is guaranteed to be new, the new root will reference all of the new revision.
//!
//! A commit involves simply writing the nodes and the freelist to disk. If the proposal is
//! abandoned, nothing has actually been written to disk.
//!
pub mod db;
pub mod manager;
Expand Down

0 comments on commit 734cfe2

Please sign in to comment.