diff --git a/firewood/src/lib.rs b/firewood/src/lib.rs index 845ead1f8..a452d1379 100644 --- a/firewood/src/lib.rs +++ b/firewood/src/lib.rs @@ -15,20 +15,13 @@ //! a very fast storage layer for the EVM but could be used on any blockchain that //! requires authenticated state. //! -//! Firewood only attempts to store the latest state on-disk and will actively clean up -//! unused state when state diffs are committed. To avoid reference counting trie nodes, -//! Firewood does not copy-on-write (COW) the state trie and instead keeps -//! one latest version of the trie index on disk and applies in-place updates to it. -//! Firewood keeps some configurable number of previous states in memory to power -//! state sync (which may occur at a few roots behind the current state). -//! -//! Firewood provides OS-level crash recovery via a write-ahead log (WAL). The WAL -//! guarantees atomicity and durability in the database, but also offers -//! “reversibility”: some portion of the old WAL can be optionally kept around to -//! allow a fast in-memory rollback to recover some past versions of the entire -//! store back in memory. While running the store, new changes will also contribute -//! to the configured window of changes (at batch granularity) to access any past -//! versions with no additional cost at all. +//! Firewood only attempts to store recent revisions on-disk and will actively clean up +//! unused older revisions when state diffs are committed. The number of revisions is +//! configured when the database is opened. +//! +//! Firewood provides OS-level crash recovery, but not machine-level crash recovery. That is, +//! if the firewood process crashes, the OS will flush the cache leave the system in a valid state. +//! No protection is (currently) offered to handle machine failures. //! //! # Design Philosophy & Overview //! @@ -62,124 +55,55 @@ //! benefit from such a design. //! //! In Firewood, we take a closer look at the second regime and have come up with a simple but -//! robust architecture that fulfills the need for such blockchain storage. +//! robust architecture that fulfills the need for such blockchain storage. However, firewood +//! can also efficiently handle the first regime. //! //! ## Storage Model //! -//! Firewood is built by three layers of abstractions that totally decouple the -//! layout/representation of the data on disk from the actual logical data structure it retains: -//! -//! - Linear, memory-like store: the `shale` crate offers a `CachedStore` abstraction for a -//! (64-bit) byte-addressable store that abstracts away the intricate method that actually persists -//! the in-memory data on the secondary storage medium (e.g., hard drive). The implementor of `CachedStore` -//! provides the functions to give the user of `CachedStore` an illusion that the user is operating upon a -//! byte-addressable memory store. It is just a "magical" array of bytes one can view and change -//! that is mirrored to the disk. In reality, the linear store will be chunked into files under a -//! directory, but the user does not have to even know about this. -//! -//! - Persistent item storage stash: `CompactStore` in `shale` defines a pool of typed objects that are -//! persisted on disk but also made accessible in memory transparently. It is built on top of `CachedStore` -//! and defines how "items" of a given type are laid out, allocated and recycled throughout their lifecycles. -//! -//! - Data structure: in Firewood, one trie is maintained by invoking `CompactStore` (see `src/merkle.rs`). -//! The data structure code is totally unaware of how its objects (i.e., nodes) are organized or -//! persisted on disk. It is as if they're just in memory, which makes it much easier to write -//! and maintain the code. -//! -//! Given the abstraction, one can easily realize the fact that the actual data that affect the -//! state of the data structure (trie) is what the linear store (`CachedStore`) keeps track of. That is, -//! a flat but conceptually large byte vector. In other words, given a valid byte vector as the -//! content of the linear store, the higher level data structure can be *uniquely* determined, there -//! is nothing more (except for some auxiliary data that are kept for performance reasons, such as caching) -//! or less than that, like a way to interpret the bytes. This nice property allows us to completely -//! separate the logical data from its physical representation, greatly simplifies the storage -//! management, and allows reusing the code. It is still a very versatile abstraction, as in theory -//! any persistent data could be stored this way -- sometimes you need to swap in a different -//! `CachedStore` implementation, but without having to touch the code for the persisted data structure. -//! -//! ## Page-based Shadowing and Revisions -//! -//! Following the idea that the tries are just a view of a linear byte store, all writes made to the -//! tries inside Firewood will eventually be consolidated into some interval writes to the linear -//! store. The writes may overlap and some frequent writes are even done to the same spot in the -//! store. To reduce the overhead and be friendly to the disk, we partition the entire 64-bit -//! virtual store into pages (yeah it appears to be more and more like an OS) and keep track of the -//! dirty pages in some `CachedStore` instantiation (see `storage::StoreRevMut`). When a -//! `db::Proposal` commits, both the recorded interval writes and the aggregated in-memory -//! dirty pages induced by this write batch are taken out from the linear store. Although they are -//! mathematically equivalent, interval writes are more compact than pages (which are 4K in size, -//! become dirty even if a single byte is touched upon) . So interval writes are fed into the WAL -//! subsystem (supported by growthring). After the WAL record is written (one record per write batch), -//! the dirty pages are then pushed to the on-disk linear store to mirror the change by some -//! asynchronous, out-of-order file writes. See the `BufferCmd::WriteBatch` part of `DiskBuffer::process` -//! for the detailed logic. +//! Firewood is built by layers of abstractions that totally decouple the layout/representation +//! of the data on disk from the actual logical data structure it retains: +//! +//! - The storage module has a [storage::NodeStore] which has a generic parameter identifying +//! the state of the nodestore, and a storage type. +//! +//! There are three states for a nodestore: +//! - [storage::Committed] for revisions that are on disk +//! - [storage::ImmutableProposal] for revisions that are proposals against committed versions +//! - [storage::MutableProposal] for revisions where nodes are still being added. +//! +//! For more information on these node states, see their associated documentation. +//! +//! The storage type is either a file or memory. Memory storage is used for creating temporary +//! merkle tries for proofs as well as testing. Nodes are identified by their offset within the +//! storage medium (a memory array or a disk file). +//! +//! ## Node caching +//! +//! Once committed, nodes never change until they expire for re-use. This means that a node cache +//! can reduce the amount of serialization and deserialization of nodes. The size of the cache, in +//! nodes, is specified when the database is opened. //! //! In short, a Read-Modify-Write (RMW) style normal operation flow is as follows in Firewood: //! -//! - Traverse the trie, and that induces the access to some nodes. Suppose the nodes are not already in -//! memory, then: -//! -//! - Bring the necessary pages that contain the accessed nodes into the memory and cache them -//! (`storage::CachedStore`). -//! -//! - Make changes to the trie, and that induces the writes to some nodes. The nodes are either -//! already cached in memory (its pages are cached, or its handle `ObjRef` is still in -//! `shale::ObjCache`) or need to be brought into the memory (if that's the case, go back to the -//! second step for it). -//! -//! - Writes to nodes are converted into interval writes to the stagging `StoreRevMut` store that -//! overlays atop `CachedStore`, so all dirty pages during the current write batch will be -//! exactly captured in `StoreRevMut` (see `StoreRevMut::delta`). -//! -//! - Finally: -//! -//! - Abort: when the write batch is dropped without invoking `db::Proposal::commit`, all in-memory -//! changes will be discarded, the dirty pages from `StoreRevMut` will be dropped and the merkle -//! will "revert" back to its original state without actually having to rollback anything. -//! -//! - Commit: otherwise, the write batch is committed, the interval writes (`storage::Ash`) will be bundled -//! into a single WAL record (`storage::AshRecord`) and sent to WAL subsystem, before dirty pages -//! are scheduled to be written to the store files. Also the dirty pages are applied to the -//! underlying `CachedStore`. `StoreRevMut` becomes empty again for further write batches. -//! -//! Parts of the following diagram show this normal flow, the "staging" store (implemented by -//! `StoreRevMut`) concept is a bit similar to the staging area in Git, which enables the handling -//! of (resuming from) write errors, clean abortion of an on-going write batch so the entire store -//! state remains intact, and also reduces unnecessary premature disk writes. Essentially, we -//! copy-on-write pages in the store that are touched upon, without directly mutating the -//! underlying "master" store. The staging store is just a collection of these "shadowing" pages -//! and a reference to the its base (master) so any reads could partially hit those dirty pages -//! and/or fall through to the base, whereas all writes are captured. Finally, when things go well, -//! we "push down" these changes to the base and clear up the staging store. -//! -//!

-//! -//!

-//! -//! Thanks to the shadow pages, we can both revive some historical versions of the store and -//! maintain a rolling window of past revisions on-the-fly. The right hand side of the diagram -//! shows previously logged write batch records could be kept even though they are no longer needed -//! for the purpose of crash recovery. The interval writes from a record can be aggregated into -//! pages (see `storage::StoreDelta::new`) and used to reconstruct a "ghost" image of past -//! revision of the linear store (just like how staging store works, except that the ghost store is -//! essentially read-only once constructed). The shadow pages there will function as some -//! "rewinding" changes to patch the necessary locations in the linear store, while the rest of the -//! linear store is very likely untouched by that historical write batch. -//! -//! Then, with the three-layer abstraction we previously talked about, a historical trie could be -//! derived. In fact, because there is no mandatory traversal or scanning in the process, the -//! only cost to revive a historical state from the log is to just playback the records and create -//! those shadow pages. There is very little additional cost because the ghost store is summoned on an -//! on-demand manner while one accesses the historical trie. -//! -//! In the other direction, when new write batches are committed, the system moves forward, we can -//! therefore maintain a rolling window of past revisions in memory with *zero* cost. The -//! mid-bottom of the diagram shows when a write batch is committed, the persisted (master) store goes one -//! step forward, the staging store is cleared, and an extra ghost store (colored in purple) can be -//! created to hold the version of the store before the commit. The backward delta is applied to -//! counteract the change that has been made to the persisted store, which is also a set of shadow pages. -//! No change is required for other historical ghost store instances. Finally, we can phase out -//! some very old ghost store to keep the size of the rolling window invariant. +//! - Create a [storage::MutableProposal] [storage::NodeStore] from the most recent [storage::Committed] one. +//! - Traverse the trie, starting at the root. Make a new root node by duplicating the existing +//! root from the committed one and save that in memory. As you continue traversing, make copies +//! of each node accessed if they are not already in memory. +//! +//! - Make changes to the trie, in memory. Each node you've accessed is currently in memory and is +//! owned by the [storage::MutableProposal]. Adding a node simply means adding a reference to it. +//! +//! - If you delete a node, mark it as deleted in the proposal and remove the child reference to it. +//! +//! - After making all mutations, convert the [storage::MutableProposal] to an [storage::ImmutableProposal]. This +//! involves walking the in-memory trie and looking for nodes without disk addresses, then assigning +//! them from the freelist of the parent. This gives the node an address, but it is stil in +//! memory. +//! +//! - Since the root is guaranteed to be new, the new root will reference all of the new revision. +//! +//! A commit involves simply writing the nodes and the freelist to disk. If the proposal is +//! abandoned, nothing has actually been written to disk. //! pub mod db; pub mod manager;