Doc update (#717)

ava-labs · Sep 6, 2024 · 734cfe2 · 734cfe2
1 parent 56decb8
commit 734cfe2
Showing 1 changed file with 50 additions and 126 deletions.
diff --git a/firewood/src/lib.rs b/firewood/src/lib.rs
@@ -15,20 +15,13 @@
 //! a very fast storage layer for the EVM but could be used on any blockchain that
 //! requires authenticated state.
 //!
-//! Firewood only attempts to store the latest state on-disk and will actively clean up
-//! unused state when state diffs are committed. To avoid reference counting trie nodes,
-//! Firewood does not copy-on-write (COW) the state trie and instead keeps
-//! one latest version of the trie index on disk and applies in-place updates to it.
-//! Firewood keeps some configurable number of previous states in memory to power
-//! state sync (which may occur at a few roots behind the current state).
-//!
-//! Firewood provides OS-level crash recovery via a write-ahead log (WAL). The WAL
-//! guarantees atomicity and durability in the database, but also offers
-//! “reversibility”: some portion of the old WAL can be optionally kept around to
-//! allow a fast in-memory rollback to recover some past versions of the entire
-//! store back in memory. While running the store, new changes will also contribute
-//! to the configured window of changes (at batch granularity) to access any past
-//! versions with no additional cost at all.
+//! Firewood only attempts to store recent revisions on-disk and will actively clean up
+//! unused older revisions when state diffs are committed. The number of revisions is
+//! configured when the database is opened.
+//!
+//! Firewood provides OS-level crash recovery, but not machine-level crash recovery. That is,
+//! if the firewood process crashes, the OS will flush the cache leave the system in a valid state.
+//! No protection is (currently) offered to handle machine failures.
 //!
 //! # Design Philosophy & Overview
 //!
@@ -62,124 +55,55 @@
 //!   benefit from such a design.
 //!
 //! In Firewood, we take a closer look at the second regime and have come up with a simple but
-//! robust architecture that fulfills the need for such blockchain storage.
+//! robust architecture that fulfills the need for such blockchain storage. However, firewood
+//! can also efficiently handle the first regime.
 //!
 //! ## Storage Model
 //!
-//! Firewood is built by three layers of abstractions that totally decouple the
-//! layout/representation of the data on disk from the actual logical data structure it retains:
-//!
-//! - Linear, memory-like store: the `shale` crate offers a `CachedStore` abstraction for a
-//!   (64-bit) byte-addressable store that abstracts away the intricate method that actually persists
-//!   the in-memory data on the secondary storage medium (e.g., hard drive). The implementor of `CachedStore`
-//!   provides the functions to give the user of `CachedStore` an illusion that the user is operating upon a
-//!   byte-addressable memory store. It is just a "magical" array of bytes one can view and change
-//!   that is mirrored to the disk. In reality, the linear store will be chunked into files under a
-//!   directory, but the user does not have to even know about this.
-//!
-//! - Persistent item storage stash: `CompactStore` in `shale` defines a pool of typed objects that are
-//!   persisted on disk but also made accessible in memory transparently. It is built on top of `CachedStore`
-//!   and defines how "items" of a given type are laid out, allocated and recycled throughout their lifecycles.
-//!
-//! - Data structure: in Firewood, one trie is maintained by invoking `CompactStore` (see `src/merkle.rs`).
-//!   The data structure code is totally unaware of how its objects (i.e., nodes) are organized or
-//!   persisted on disk. It is as if they're just in memory, which makes it much easier to write
-//!   and maintain the code.
-//!
-//! Given the abstraction, one can easily realize the fact that the actual data that affect the
-//! state of the data structure (trie) is what the linear store (`CachedStore`) keeps track of. That is,
-//! a flat but conceptually large byte vector. In other words, given a valid byte vector as the
-//! content of the linear store, the higher level data structure can be *uniquely* determined, there
-//! is nothing more (except for some auxiliary data that are kept for performance reasons, such as caching)
-//! or less than that, like a way to interpret the bytes. This nice property allows us to completely
-//! separate the logical data from its physical representation, greatly simplifies the storage
-//! management, and allows reusing the code. It is still a very versatile abstraction, as in theory
-//! any persistent data could be stored this way -- sometimes you need to swap in a different
-//! `CachedStore` implementation, but without having to touch the code for the persisted data structure.
-//!
-//! ## Page-based Shadowing and Revisions
-//!
-//! Following the idea that the tries are just a view of a linear byte store, all writes made to the
-//! tries inside Firewood will eventually be consolidated into some interval writes to the linear
-//! store. The writes may overlap and some frequent writes are even done to the same spot in the
-//! store. To reduce the overhead and be friendly to the disk, we partition the entire 64-bit
-//! virtual store into pages (yeah it appears to be more and more like an OS) and keep track of the
-//! dirty pages in some `CachedStore` instantiation (see `storage::StoreRevMut`). When a
-//! `db::Proposal` commits, both the recorded interval writes and the aggregated in-memory
-//! dirty pages induced by this write batch are taken out from the linear store. Although they are
-//! mathematically equivalent, interval writes are more compact than pages (which are 4K in size,
-//! become dirty even if a single byte is touched upon) . So interval writes are fed into the WAL
-//! subsystem (supported by growthring). After the WAL record is written (one record per write batch),
-//! the dirty pages are then pushed to the on-disk linear store to mirror the change by some
-//! asynchronous, out-of-order file writes. See the `BufferCmd::WriteBatch` part of `DiskBuffer::process`
-//! for the detailed logic.
+//! Firewood is built by layers of abstractions that totally decouple the layout/representation
+//! of the data on disk from the actual logical data structure it retains:
+//!
+//! - The storage module has a [storage::NodeStore] which has a generic parameter identifying
+//!   the state of the nodestore, and a storage type.
+//!
+//!   There are three states for a nodestore:
+//!    - [storage::Committed] for revisions that are on disk
+//!    - [storage::ImmutableProposal] for revisions that are proposals against committed versions
+//!    - [storage::MutableProposal] for revisions where nodes are still being added.
+//!
+//!  For more information on these node states, see their associated documentation.
+//!
+//!   The storage type is either a file or memory. Memory storage is used for creating temporary
+//!   merkle tries for proofs as well as testing. Nodes are identified by their offset within the
+//!   storage medium (a memory array or a disk file).
+//!
+//! ## Node caching
+//!
+//! Once committed, nodes never change until they expire for re-use. This means that a node cache
+//! can reduce the amount of serialization and deserialization of nodes. The size of the cache, in
+//! nodes, is specified when the database is opened.
 //!
 //! In short, a Read-Modify-Write (RMW) style normal operation flow is as follows in Firewood:
 //!
-//! - Traverse the trie, and that induces the access to some nodes. Suppose the nodes are not already in
-//!   memory, then:
-//!
-//! - Bring the necessary pages that contain the accessed nodes into the memory and cache them
-//!   (`storage::CachedStore`).
-//!
-//! - Make changes to the trie, and that induces the writes to some nodes. The nodes are either
-//!   already cached in memory (its pages are cached, or its handle `ObjRef<Node>` is still in
-//!   `shale::ObjCache`) or need to be brought into the memory (if that's the case, go back to the
-//!   second step for it).
-//!
-//! - Writes to nodes are converted into interval writes to the stagging `StoreRevMut` store that
-//!   overlays atop `CachedStore`, so all dirty pages during the current write batch will be
-//!   exactly captured in `StoreRevMut` (see `StoreRevMut::delta`).
-//!
-//! - Finally:
-//!
-//!   - Abort: when the write batch is dropped without invoking `db::Proposal::commit`, all in-memory
-//!     changes will be discarded, the dirty pages from `StoreRevMut` will be dropped and the merkle
-//!     will "revert" back to its original state without actually having to rollback anything.
-//!
-//!   - Commit: otherwise, the write batch is committed, the interval writes (`storage::Ash`) will be bundled
-//!     into a single WAL record (`storage::AshRecord`) and sent to WAL subsystem, before dirty pages
-//!     are scheduled to be written to the store files. Also the dirty pages are applied to the
-//!     underlying `CachedStore`. `StoreRevMut` becomes empty again for further write batches.
-//!
-//! Parts of the following diagram show this normal flow, the "staging" store (implemented by
-//! `StoreRevMut`) concept is a bit similar to the staging area in Git, which enables the handling
-//! of (resuming from) write errors, clean abortion of an on-going write batch so the entire store
-//! state remains intact, and also reduces unnecessary premature disk writes. Essentially, we
-//! copy-on-write pages in the store that are touched upon, without directly mutating the
-//! underlying "master" store. The staging store is just a collection of these "shadowing" pages
-//! and a reference to the its base (master) so any reads could partially hit those dirty pages
-//! and/or fall through to the base, whereas all writes are captured. Finally, when things go well,
-//! we "push down" these changes to the base and clear up the staging store.
-//!
-//! <p align="center">
-//!     <img src="https://ava-labs.github.io/firewood/assets/architecture.svg" width="100%">
-//! </p>
-//!
-//! Thanks to the shadow pages, we can both revive some historical versions of the store and
-//! maintain a rolling window of past revisions on-the-fly. The right hand side of the diagram
-//! shows previously logged write batch records could be kept even though they are no longer needed
-//! for the purpose of crash recovery. The interval writes from a record can be aggregated into
-//! pages (see `storage::StoreDelta::new`) and used to reconstruct a "ghost" image of past
-//! revision of the linear store (just like how staging store works, except that the ghost store is
-//! essentially read-only once constructed). The shadow pages there will function as some
-//! "rewinding" changes to patch the necessary locations in the linear store, while the rest of the
-//! linear store is very likely untouched by that historical write batch.
-//!
-//! Then, with the three-layer abstraction we previously talked about, a historical trie could be
-//! derived. In fact, because there is no mandatory traversal or scanning in the process, the
-//! only cost to revive a historical state from the log is to just playback the records and create
-//! those shadow pages. There is very little additional cost because the ghost store is summoned on an
-//! on-demand manner while one accesses the historical trie.
-//!
-//! In the other direction, when new write batches are committed, the system moves forward, we can
-//! therefore maintain a rolling window of past revisions in memory with *zero* cost. The
-//! mid-bottom of the diagram shows when a write batch is committed, the persisted (master) store goes one
-//! step forward, the staging store is cleared, and an extra ghost store (colored in purple) can be
-//! created to hold the version of the store before the commit. The backward delta is applied to
-//! counteract the change that has been made to the persisted store, which is also a set of shadow pages.
-//! No change is required for other historical ghost store instances. Finally, we can phase out
-//! some very old ghost store to keep the size of the rolling window invariant.
+//! - Create a [storage::MutableProposal] [storage::NodeStore] from the most recent [storage::Committed] one.
+//! - Traverse the trie, starting at the root. Make a new root node by duplicating the existing
+//!   root from the committed one and save that in memory. As you continue traversing, make copies
+//!   of each node accessed if they are not already in memory.
+//!
+//! - Make changes to the trie, in memory. Each node you've accessed is currently in memory and is
+//!   owned by the [storage::MutableProposal]. Adding a node simply means adding a reference to it.
+//!
+//! - If you delete a node, mark it as deleted in the proposal and remove the child reference to it.
+//!
+//! - After making all mutations, convert the [storage::MutableProposal] to an [storage::ImmutableProposal]. This
+//!   involves walking the in-memory trie and looking for nodes without disk addresses, then assigning
+//!   them from the freelist of the parent. This gives the node an address, but it is stil in
+//!   memory.
+//!
+//! - Since the root is guaranteed to be new, the new root will reference all of the new revision.
+//!
+//! A commit involves simply writing the nodes and the freelist to disk. If the proposal is
+//! abandoned, nothing has actually been written to disk.
 //!
 pub mod db;
 pub mod manager;