From c7cc9456396e539bdc20f0522a48b104e37eea81 Mon Sep 17 00:00:00 2001 From: RJ Rybarczyk Date: Mon, 4 Nov 2024 19:44:55 -0500 Subject: [PATCH] Update README --- utils/buffer/README.md | 333 +++++++++++++++++++++++++---------------- 1 file changed, 201 insertions(+), 132 deletions(-) diff --git a/utils/buffer/README.md b/utils/buffer/README.md index 39ac180e2..27a626966 100644 --- a/utils/buffer/README.md +++ b/utils/buffer/README.md @@ -1,105 +1,154 @@ -# BufferPool +# `buffer_sv2` -This crate provides a `Write` trait used to replace `std::io::Write` in a non_std environment a `Buffer` -trait and two implementations of `Buffer`: `BufferFromSystemMemory` and `BufferPool`. +[![crates.io](https://img.shields.io/crates/v/buffer_sv2.svg)](https://crates.io/crates/buffer_sv2) +[![docs.rs](https://docs.rs/buffer_sv2/badge.svg)](https://docs.rs/buffer_sv2) +[![rustc+](https://img.shields.io/badge/rustc-1.75.0%2B-lightgrey.svg)](https://blog.rust-lang.org/2023/12/28/Rust-1.75.0.html) +[![license](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](https://github.com/stratum-mining/stratum/blob/main/LICENSE.md) +[![codecov](https://codecov.io/gh/stratum-mining/stratum/branch/main/graph/badge.svg?flag=buffer_sv2-coverage)](https://codecov.io/gh/stratum-mining/stratum) -## Intro -`BufferPool` is useful whenever we need to work with buffers sequentially (fill a buffer, get -the filled buffer, fill a new buffer, get the filled buffer, and so on). +`buffer_sv2` handles memory management for Stratum v2 (Sv2) roles. It provides a memory-efficient +buffer pool that minimizes allocations and deallocations for high-throughput message frame +processing in Sv2 roles. Memory allocation overhead is minimized by reusing large buffers, +improving performance and reducing latency. The buffer pool tracks the usage of memory slices, +using shared state tracking to safely manage memory across multiple threads. -To fill a buffer `BufferPool` returns an `&mut [u8]` with the requested len (the filling part). -When the buffer is filled and the owner needs to be changed, `BufferPool` returns a `Slice` that -implements `Send` and `AsMut` (the get part). +## Main Components -`BufferPool` pre-allocates a user-defined capacity in the heap and use it to allocate the buffers, -when a `Slice` is dropped `BufferPool` acknowledge it and reuse the freed space in the pre-allocated -memory. +- **Buffer Trait**: An interface for working with memory buffers. This trait has two implementations + (`BufferPool` and `BufferFromSystemMemory`) that includes a `Write` trait to replace + `std::io::Write` in `no_std` environments. +- **BufferPool**: A thread-safe pool of reusable memory buffers for high-throughput applications. +- **BufferFromSystemMemory**: Manages a dynamically growing buffer in system memory for applications + where performance is not a concern. +- **Slice**: A contiguous block of memory, either preallocated or dynamically allocated. -## Implementation -The crate is `[no_std]` and lock-free, so, to synchronize the state of the pre-allocated memory -(taken or free) between different contexts an `AtomicU8` is used. -Each bit of the `u8` represent a memory slot, if the bit is 0 the memory slot is free if is -1 is taken. Whenever `BufferPool` creates a `Slice` a bit is set to 1 and whenever a `Slice` is -dropped a bit is set to 0. +## Usage -## Use case -`BufferPool` has been developed to be used in proxies with thousand of connection, each connection -must parse a particular data format via a `Decoder` each decoder use 1 or 2 `Buffer` for each received -message. With `BufferPool` each connection can be instantiated with its own `BufferPool` and reuse -the space freed by old messages for new ones. +To include this crate in your project, run: -## Unsafe -There are 5 unsafes: -buffer_pool/mod.rs 550 -slice.rs 8 -slice.rs 27 +```bash +cargo add buffer_sv2 +``` + +This crate can be built with the following feature flags: + +- `debug`: Provides additional tracking for debugging memory management issues. +- `fuzz`: Enables support for fuzz testing. +- `with_serde`: builds [`binary_sv2`](https://crates.io/crates/binary_sv2) and + [`buffer_sv2`](https://crates.io/crates/buffer_sv2) crates with `serde`-based encoding and + decoding. Note that this feature flag is only used for the Message Generator, and deprecated + for any other kind of usage. It will likely be fully deprecated in the future. + +### Unsafe Code +There are four unsafe code blocks instances: + +- `buffer_pool/mod.rs`: `fn get_writable_(&mut self, len: usize, shared_state: u8, without_check: bool) -> &mut [u8] { .. }` in the `impl BufferPool` +- `slice.rs`: + - `unsafe impl Send for Slice {}` + - `fn as_mut(&mut self) -> &mut [u8] { .. }` in the `impl AsMut<[u8]> for Slice` + - `fn as_ref(&mut self) -> &mut [u8] { .. }` in the `impl AsMut<[u8]> for Slice` + +### Examples + +This crate provides four examples demonstrating how the memory is managed: + +1. **[Basic Usage Example](https://github.com/stratum-mining/stratum/blob/main/protocols/v2/codec-sv2/examples/basic_buffer_pool.rs)**: + Creates a buffer pool, writes to it, and retrieves the data from it. -## Write -Waiting for `Write` in `core` a compatible trait is used so that it can be replaced. +2. **[Buffer Pool Exhaustion Example](https://github.com/stratum-mining/stratum/blob/main/protocols/v2/codec-sv2/examples/buffer_pool_exhaustion.rs)**: + Demonstrates how data is added to a buffer pool and dynamically allocates directly to the heap + once the buffer pool's capacity has been exhausted. -## Buffer -The `Buffer` trait has been written to work with `codec_sv2::Decoder`. +3. **[Variable Sized Messages Example](https://github.com/stratum-mining/stratum/blob/main/protocols/v2/codec-sv2/examples/variable_sized_messages.rs)**: + Writes messages of variable sizes to the buffer pool. -`codec_sv2::Decoder` works by: -1. fill a buffer of the size of the header of the protocol that is decoding -2. parse the filled bytes and compute the message length -3. fill a buffer of the size of the message -4. use the header and the message to construct a `frame_sv2::Frame` +3. **[Multi Threaded Example](https://github.com/stratum-mining/stratum/blob/main/protocols/v2/codec-sv2/examples/multi_threaded_buffer_pool.rs)**: + Writes to the buffer pool in a multi-threaded context. -To fill the buffer `Decoder` must pass a reference of the buffer to a filler. In order -to construct a `Frame` the `Decoder` must pass the ownership of the buffer to `Frame`. +## `Buffer` Trait + +The `Buffer` trait is designed to work with the +[`codec_sv2`](https://docs.rs/codec_sv2/1.3.0/codec_sv2/index.html) decoders, which operate by: + +1. Filling a buffer with the size of the protocol header being decoded. +2. Parsing the filled bytes to compute the message length. +3. Filling a buffer with the size of the message. +4. Using the header and message to construct a + [`framing_sv2::framing::Frame`](https://docs.rs/framing_sv2/2.0.0/framing_sv2/framing/enum.Frame.html). + +To fill the buffer, the `codec_sv2` decoder must bass a reference of the buffer to a filler. To +construct a `Frame`, the decoder must pass ownership of the buffer to the `Frame`. ```rust -get_writable(&mut self, len: usize) -> &mut [u8] +fn get_writable(&mut self, len: usize) -> &mut [u8]; ``` -Return a mutable reference to the buffer, starting at buffer length and ending at buffer length + `len`. -and set buffer len at previous len + len. + +This `get_writable` method returns a mutable reference to the buffer, starting at the current +length and ending at `len`, and sets the buffer length to the previous length plus `len`. ```rust -get_data_owned(&mut self) -> Slice { +get_data_owned(&mut self) -> Slice; ``` -It returns `Slice`: something that implements `AsMut[u8]` and `Send`, and sets the buffer len to 0. -## BufferFromSystemMemory -Is the simplest implementation of a `Buffer`: each time that a new buffer is needed it create a new -`Vec`. +This `get_data_owned` method returns a `Slice` that implements `AsMut<[u8]>` and `Send`, and resets +the buffer length to `0`. -`get_writable(..)` returns mutable references to the inner vector. +The `Buffer` trait is implemented for `BufferFromSystemMemory` and `BufferPool`. It includes a +`Write` trait to replace `std::io::Write` in `no_std` environments. -`get_data_owned(..)` returns the inner vector. +## `BufferPoolFromSystemMemory` +`BufferFromSystemMemory` is a simple implementation of the `Buffer` trait. Each time a new buffer is +needed, it creates a new `Vec`. +- `get_writable(..)` returns mutable references to the inner vector. +- `get_data_owned(..)` returns the inner vector. -## BufferPool -Usually `BufferFromSystemMemory` should be enough, but sometimes it is better to use something faster. +## `BufferPool` +While `BufferFromSystemMemory` is sufficient for many cases, `BufferPool` offers a more efficient +solution for high-performance applications, such as proxies and pools with thousands of connections. -For each Sv2 connection, there is a `Decoder` and for each decoder, there are 1 or 2 buffers. +When created, `BufferPool` preallocates a user-defined capacity of bytes in the heap using a +`Vec`. When `get_data_owned(..)` is called, it creates a `Slice` that contains a view into the +preallocated memory. `BufferPool` guarantees that slices never overlap and maintains unique +ownership of each `Slice`. -Proxies and pools with thousands of connections should use `Decoder` rather than -`Decoder` +`Slice` implements the `Drop`, allowing the view into the preallocated memory to be reused upon +dropping. -`BufferPool` when created preallocate a user-defined capacity of bytes in the heap using a -`Vec`, then when `get_data_owned(..)` is called it create a `Slice` that contains a view into -the preallocated memory. `BufferPool` guarantees that slices never overlap and the uniqueness of -the `Slice` ownership. +### Buffer Management and Allocation -`Slice` implements `Drop` so that the view into the preallocated memory can be reused. +`BufferPool` is useful for working with sequentially processed buffers, such as filling a buffer, +retrieving it, and then reusing it as needed. `BufferPool` optimizes for memory reuse by providing +pre-allocated memory that can be used in one of three modes: -### Fragmentation overflow and optimization -`BufferPool` can allocate a maximum of 8 `Slices` (cause it uses an `AtomicU8` to keep track of the -used and freed slots) and at maximum `capacity` bytes. Whenever all the 8 slots are tacked or there -is no more space on the preallocated memory `BufferPool` failover to a `BufferFromSystemMemory`. +1. **Back Mode**: Default mode where allocations start from the back of the buffer. +2. **Front Mode**: Used when slots at the back are full but memory can still be reused by moving to + the front. +3. **Alloc Mode**: Falls back to system memory allocation (`BufferFromSystemMemory`) when both back + and front sections are full, providing additional capacity but with reduced performance. -Usually, a message is decoded then a response is sent then a new message is decoded, etc. -So `BufferPool` is optimized for use all the slots then check if the first slot has been dropped -If so use it, then check if the second slot has been dropped, and so on. -`BufferPool` is also optimized to drop all the slices. -`BufferPool` is also optimized to drop the last slice. +`BufferPool` can only be fragmented between the front and back and between back and end. -Below a graphical representation of the most optimized cases: -``` -A number [0f] means that the slot is taken, the minus symbol (-) means the slot is free -There are 8 slot +#### Fragmentation, Overflow, and Optimization +`BufferPool` can allocate a maximum of `8` `Slice`s (as it uses an `AtomicU8` to track used and +freed slots) and up to the defined capacity in bytes. If all `8` slows are taken or there is no more +space in the preallocated memory, `BufferPool` falls back to `BufferFromSystemMemory`. + +Typically, `BufferPool` is used to process messages sequentially (decode, respond, decode). It is +optimized to check for any freed slots starting from the beginning, then reuse these before +considering further allocation. It is also optimized to drop all the slices and to drop the last +slice. It also efficiently handles scenarios where all slices are dropped or when the last slice is +released, reducing memory fragmentation. + +The following cases illustrate typical memory usage patterns within `BufferPool`: +1. Slots fill from back to front, switching as each area reaches capacity. +2. Pool resets upon full usage, then reuses back slots. +3. After filling the back, front slots are used when they become available. + +Below is a graphical representation of the most optimized cases. A number (`0f`) means that the +slot is taken, the minus symbol (`-`) means the slot is free. There are `8` slots. +``` CASE 1 -------- BACK MODE 1------- BACK MODE @@ -147,30 +196,60 @@ CASE 3 9a3456-- SWITCH TO BACK MODE 9a3456b- BACK MODE 9a3456bc BACK MODE +``` + +## Benchmarks and Performance + +To run benchmarks, execute: +``` +cargo bench --features criterion ``` -`BufferPool` can operate in three modalities: -1. Back: it allocates in the back of the inner vector -2. Front: it allocates in the front of the inner vector -3. Alloc: failover to `BufferFromSystemMemory` +## Benchmarks Comparisons -`BufferPool` can be fragmented only between front and back and between back and end. +`BufferPool` is benchmarked against `BufferFromSystemMemory` and two additional structure for +reference: `PPool` (a hashmap-based pool) and `MaxEfficeincy` (a highly optimized but unrealistic +control implementation writen such that the benchmarks do not panic and the compiler does not +complain). `BufferPool` generally provides better performance and lower latency than `PPool` and +`BufferFromSystemMemory`. -### Performance +**Note**: Both `PPool` and `MaxEfficeincy` are completely broken and are only useful as references +for the benchmarks. -To run the benchmarks `cargo bench --features criterion`. +### `BENCHES.md` Benchmarks +The `BufferPool` always outperforms the `PPool` (hashmap-based pool) and the solution without a +pool. -To have an idea of the performance gains, `BufferPool` is benchmarked against -`BufferFromSystemMemory` and two control structures `PPool` and `MaxEfficeincy`. +Executed for 2,000 samples: -`PPool` is a buffer pool implemented with a hashmap and `MaxEfficeincy` is a `Buffer` implemented in the -fastest possible way so that the benches do not panic and the compiler does not complain. Btw they are -both completely broken, useful only as references for the benchmarks. +``` +* single thread with `BufferPool`: ---------------------------------- 7.5006 ms +* single thread with `BufferFromSystemMemory`: ---------------------- 10.274 ms +* single thread with `PPoll`: --------------------------------------- 32.593 ms +* single thread with `MaxEfficeincy`: ------------------------------- 1.2618 ms +* multi-thread with `BufferPool`: ---------------------------------- 34.660 ms +* multi-thread with `BufferFromSystemMemory`: ---------------------- 142.23 ms +* multi-thread with `PPoll`: --------------------------------------- 49.790 ms +* multi-thread with `MaxEfficeincy`: ------------------------------- 18.201 ms +* multi-thread 2 with `BufferPool`: ---------------------------------- 80.869 ms +* multi-thread 2 with `BufferFromSystemMemory`: ---------------------- 192.24 ms +* multi-thread 2 with `PPoll`: --------------------------------------- 101.75 ms +* multi-thread 2 with `MaxEfficeincy`: ------------------------------- 66.972 ms +``` -The benchmarks are: +### Single Thread Benchmarks + +If the buffer is not sent to another context `BufferPool`, it is 1.4 times faster than no pool, 4.3 +time faster than the `PPool`, and 5.7 times slower than max efficiency. + +Average times for 1,000 operations: + +- `BufferPool`: 7.5 ms +- `BufferFromSystemMemory`: 10.27 ms +- `PPool`: 32.59 ms +- `MaxEfficiency`: 1.26 ms -#### Single thread ``` for 0..1000: add random bytes to the buffer @@ -178,9 +257,18 @@ for 0..1000: add random bytes to the buffer get the buffer drop the 2 buffer - ``` +``` + +### Multi-Threaded Benchmarks (most similar to actual use case) + +If the buffer is sent to other contexts, `BufferPool` is 4 times faster than no pool, 0.6 times +faster than `PPool`, and 1.8 times slower than max efficiency. + +- `BufferPool`: 34.66 ms +- `BufferFromSystemMemory`: 142.23 ms +- `PPool`: 49.79 ms +- `MaxEfficiency`: 18.20 ms -#### Multi threads (this is the most similar to the actual use case IMHO) ``` for 0..1000: add random bytes to the buffer @@ -189,9 +277,9 @@ for 0..1000: add random bytes to the buffer get the buffer send the buffer to another thread -> wait 1 ms and then drop it - ``` +``` -#### Multi threads 2 +### Multi threads 2 ``` for 0..1000: add random bytes to the buffer @@ -201,54 +289,35 @@ for 0..1000: get the buffer send the buffer to another thread -> wait 1 ms and then drop it wait for the 2 buffer to be dropped - ``` - -#### Test -Some failing cases from fuzz. - -#### From the benchmark in BENCHES.md executed for 2000 samples: -``` -* single thread with `BufferPool`: ---------------------------------- 7.5006 ms -* single thread with `BufferFromSystemMemory`: ---------------------- 10.274 ms -* single thread with `PPoll`: --------------------------------------- 32.593 ms -* single thread with `MaxEfficeincy`: ------------------------------- 1.2618 ms -* multi-thread with `BufferPool`: ---------------------------------- 34.660 ms -* multi-thread with `BufferFromSystemMemory`: ---------------------- 142.23 ms -* multi-thread with `PPoll`: --------------------------------------- 49.790 ms -* multi-thread with `MaxEfficeincy`: ------------------------------- 18.201 ms -* multi-thread 2 with `BufferPool`: ---------------------------------- 80.869 ms -* multi-thread 2 with `BufferFromSystemMemory`: ---------------------- 192.24 ms -* multi-thread 2 with `PPoll`: --------------------------------------- 101.75 ms -* multi-thread 2 with `MaxEfficeincy`: ------------------------------- 66.972 ms ``` -From the above numbers, it results that `BufferPool` always outperform the hashmap buffer pool and -the solution without a pool: +## Fuzz Testing +Install `cargo-fuzz` with: -#### Single thread -If the buffer is not sent to another context `BufferPool` is 1.4 times faster than no pool and 4.3 time -faster than the hashmap pool and 5.7 times slower than max efficiency. - -#### Multi threads -If the buffer is sent to other contexts `BufferPool` is 4 times faster than no pool, 0.6 times faster -than the hashmap pool and 1.8 times slower than max efficiency. +```bash +cargo install cargo-fuzz +``` -### Fuzzy tests -Install cargo fuzz with `cargo install cargo-fuzz` +Run the fuzz tests: -Then do `cd ./fuzz` +```bash +cd ./fuzz +cargo fuzz run slower -- -rss_limit_mb=5000000000 +cargo fuzz run faster -- -rss_limit_mb=5000000000 +``` +The test must be run with `-rss_limit_mb=5000000000` as this flag checks `BufferPool` with +capacities from `0` to `2^32`. -Run them with `cargo fuzz run slower -- -rss_limit_mb=5000000000` and -`cargo fuzz run faster -- -rss_limit_mb=5000000000` +`BufferPool` is fuzz-tested to ensure memory reliability across different scenarios, including +delayed memory release and cross-thread access. The tests checks if slices created by `BufferPool` +still contain the same bytes contained at creation time after a random amount of time and after it +has been sent to other threads. -`BufferPool` is fuzzy tested with `cargo fuzzy`. The test checks if slices created by `BufferPool` -still contain the same bytes contained at creation time after a random amount of time and after been -sent to other threads. There are 2 fuzzy test, the first (faster) it map a smaller input space to -test the most likely inputs, the second (slower) it have a bigger input space to pick "all" the -corner case. The slower also forces the buffer to be sent to different cores. I run both for several -hours without crashes. +There are 2 fuzzy test, the first (faster) it map a smaller input space to +Two main fuzz tests are provided: -The test must be run with `-rss_limit_mb=5000000000` cause they check `BufferPool` with capacities -from 0 to 2^32. +1. Faster: Maps a smaller input space to test the most likely inputs +2. Slower: Has a bigger input space to explore "all" the edge case. It forces the buffer to be sent + to different cores. -(1) TODO check if is always true +Both tests have been run for several hours without crashes.