From c7cc9456396e539bdc20f0522a48b104e37eea81 Mon Sep 17 00:00:00 2001
From: RJ Rybarczyk <rj@rybar.tech>
Date: Mon, 4 Nov 2024 19:44:55 -0500
Subject: [PATCH] Update README

---
 utils/buffer/README.md | 333 +++++++++++++++++++++++++----------------
 1 file changed, 201 insertions(+), 132 deletions(-)
diff --git a/utils/buffer/README.md b/utils/buffer/README.md
index 39ac180e2..27a626966 100644
--- a/utils/buffer/README.md
+++ b/utils/buffer/README.md
@@ -1,105 +1,154 @@
-# BufferPool
+# `buffer_sv2`
 
-This crate provides a `Write` trait used to replace `std::io::Write` in a non_std environment a `Buffer`
-trait and two implementations of `Buffer`: `BufferFromSystemMemory` and `BufferPool`.
+[![crates.io](https://img.shields.io/crates/v/buffer_sv2.svg)](https://crates.io/crates/buffer_sv2)
+[![docs.rs](https://docs.rs/buffer_sv2/badge.svg)](https://docs.rs/buffer_sv2)
+[![rustc+](https://img.shields.io/badge/rustc-1.75.0%2B-lightgrey.svg)](https://blog.rust-lang.org/2023/12/28/Rust-1.75.0.html)
+[![license](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](https://github.com/stratum-mining/stratum/blob/main/LICENSE.md)
+[![codecov](https://codecov.io/gh/stratum-mining/stratum/branch/main/graph/badge.svg?flag=buffer_sv2-coverage)](https://codecov.io/gh/stratum-mining/stratum)
 
-## Intro
-`BufferPool` is useful whenever we need to work with buffers sequentially (fill a buffer, get
-the filled buffer, fill a new buffer, get the filled buffer, and so on).
+`buffer_sv2` handles memory management for Stratum v2 (Sv2) roles. It provides a memory-efficient
+buffer pool that minimizes allocations and deallocations for high-throughput message frame
+processing in Sv2 roles. Memory allocation overhead is minimized by reusing large buffers,
+improving performance and reducing latency. The buffer pool tracks the usage of memory slices,
+using shared state tracking to safely manage memory across multiple threads.
 
-To fill a buffer `BufferPool` returns an `&mut [u8]` with the requested len (the filling part).
-When the buffer is filled and the owner needs to be changed, `BufferPool` returns a `Slice` that
-implements `Send` and `AsMut<u8>` (the get part).
+## Main Components
 
-`BufferPool` pre-allocates a user-defined capacity in the heap and use it to allocate the buffers,
-when a `Slice` is dropped `BufferPool` acknowledge it and reuse the freed space in the pre-allocated
-memory.
+- **Buffer Trait**: An interface for working with memory buffers. This trait has two implementations
+  (`BufferPool` and `BufferFromSystemMemory`) that includes a `Write` trait to replace
+  `std::io::Write` in `no_std` environments.
+- **BufferPool**: A thread-safe pool of reusable memory buffers for high-throughput applications.
+- **BufferFromSystemMemory**: Manages a dynamically growing buffer in system memory for applications
+  where performance is not a concern.
+- **Slice**: A contiguous block of memory, either preallocated or dynamically allocated.
 
-## Implementation
-The crate is `[no_std]` and lock-free, so, to synchronize the state of the pre-allocated memory
-(taken or free) between different contexts an `AtomicU8` is used.
-Each bit of the `u8` represent a memory slot, if the bit is 0 the memory slot is free if is
-1 is taken. Whenever `BufferPool` creates a `Slice` a bit is set to 1 and whenever a `Slice` is
-dropped a bit is set to 0.
+## Usage
 
-## Use case
-`BufferPool` has been developed to be used in proxies with thousand of connection, each connection
-must parse a particular data format via a `Decoder` each decoder use 1 or 2 `Buffer` for each received
-message. With `BufferPool` each connection can be instantiated with its own `BufferPool` and reuse
-the space freed by old messages for new ones.
+To include this crate in your project, run:
 
-## Unsafe
-There are 5 unsafes:
-buffer_pool/mod.rs 550
-slice.rs 8
-slice.rs 27
+```bash
+cargo add buffer_sv2
+```
+
+This crate can be built with the following feature flags:
+
+- `debug`: Provides additional tracking for debugging memory management issues.
+- `fuzz`: Enables support for fuzz testing.
+- `with_serde`: builds [`binary_sv2`](https://crates.io/crates/binary_sv2) and
+  [`buffer_sv2`](https://crates.io/crates/buffer_sv2) crates with `serde`-based encoding and
+  decoding. Note that this feature flag is only used for the Message Generator, and deprecated
+  for any other kind of usage. It will likely be fully deprecated in the future.
+
+### Unsafe Code
+There are four unsafe code blocks instances:
+
+- `buffer_pool/mod.rs`: `fn get_writable_(&mut self, len: usize, shared_state: u8, without_check: bool) -> &mut [u8] { .. }` in the `impl<T: Buffer> BufferPool<T>`
+- `slice.rs`:
+  - `unsafe impl Send for Slice {}`
+  - `fn as_mut(&mut self) -> &mut [u8] { .. }` in the `impl AsMut<[u8]> for Slice`
+  - `fn as_ref(&mut self) -> &mut [u8] { .. }` in the `impl AsMut<[u8]> for Slice`
+
+### Examples
+
+This crate provides four examples demonstrating how the memory is managed:
+
+1. **[Basic Usage Example](https://github.com/stratum-mining/stratum/blob/main/protocols/v2/codec-sv2/examples/basic_buffer_pool.rs)**:
+   Creates a buffer pool, writes to it, and retrieves the data from it.
 
-## Write
-Waiting for `Write` in `core` a compatible trait is used so that it can be replaced.
+2. **[Buffer Pool Exhaustion Example](https://github.com/stratum-mining/stratum/blob/main/protocols/v2/codec-sv2/examples/buffer_pool_exhaustion.rs)**:
+   Demonstrates how data is added to a buffer pool and dynamically allocates directly to the heap
+   once the buffer pool's capacity has been exhausted.
 
-## Buffer
-The `Buffer` trait has been written to work with `codec_sv2::Decoder`.
+3. **[Variable Sized Messages Example](https://github.com/stratum-mining/stratum/blob/main/protocols/v2/codec-sv2/examples/variable_sized_messages.rs)**:
+   Writes messages of variable sizes to the buffer pool.
 
-`codec_sv2::Decoder` works by:
-1. fill a buffer of the size of the header of the protocol that is decoding
-2. parse the filled bytes and compute the message length
-3. fill a buffer of the size of the message
-4. use the header and the message to construct a `frame_sv2::Frame`
+3. **[Multi Threaded Example](https://github.com/stratum-mining/stratum/blob/main/protocols/v2/codec-sv2/examples/multi_threaded_buffer_pool.rs)**:
+   Writes to the buffer pool in a multi-threaded context.
 
-To fill the buffer `Decoder` must pass a reference of the buffer to a filler. In order
-to construct a `Frame` the `Decoder` must pass the ownership of the buffer to `Frame`.
+## `Buffer` Trait
+
+The `Buffer` trait is designed to work with the
+[`codec_sv2`](https://docs.rs/codec_sv2/1.3.0/codec_sv2/index.html) decoders, which operate by:
+
+1. Filling a buffer with the size of the protocol header being decoded.
+2. Parsing the filled bytes to compute the message length.
+3. Filling a buffer with the size of the message.
+4. Using the header and message to construct a
+   [`framing_sv2::framing::Frame`](https://docs.rs/framing_sv2/2.0.0/framing_sv2/framing/enum.Frame.html).
+
+To fill the buffer, the `codec_sv2` decoder must bass a reference of the buffer to a filler. To
+construct a `Frame`, the decoder must pass ownership of the buffer to the `Frame`.
 
 ```rust
-get_writable(&mut self, len: usize) -> &mut [u8]
+fn get_writable(&mut self, len: usize) -> &mut [u8];
 ```
-Return a mutable reference to the buffer, starting at buffer length and ending at buffer length + `len`.
-and set buffer len at previous len + len.
+
+This `get_writable` method returns a mutable reference to the buffer, starting at the current
+length and ending at `len`, and sets the buffer length to the previous length plus `len`.
 
 ```rust
-get_data_owned(&mut self) -> Slice {
+get_data_owned(&mut self) -> Slice;
 ```
-It returns `Slice`: something that implements `AsMut[u8]` and `Send`, and sets the buffer len to 0.
 
-## BufferFromSystemMemory
-Is the simplest implementation of a `Buffer`: each time that a new buffer is needed it create a new
-`Vec<u8>`.
+This `get_data_owned` method returns a `Slice` that implements `AsMut<[u8]>` and `Send`, and resets
+the buffer length to `0`.
 
-`get_writable(..)` returns mutable references to the inner vector.
+The `Buffer` trait is implemented for `BufferFromSystemMemory` and `BufferPool`. It includes a
+`Write` trait to replace `std::io::Write` in `no_std` environments.
 
-`get_data_owned(..)` returns the inner vector.
+## `BufferPoolFromSystemMemory`
+`BufferFromSystemMemory` is a simple implementation of the `Buffer` trait. Each time a new buffer is
+needed, it creates a new `Vec<u8>`.
 
+- `get_writable(..)` returns mutable references to the inner vector.
+- `get_data_owned(..)` returns the inner vector.
 
-## BufferPool
-Usually `BufferFromSystemMemory` should be enough, but sometimes it is better to use something faster.
+## `BufferPool`
+While `BufferFromSystemMemory` is sufficient for many cases, `BufferPool` offers a more efficient
+solution for high-performance applications, such as proxies and pools with thousands of connections.
 
-For each Sv2 connection, there is a `Decoder` and for each decoder, there are 1 or 2 buffers.
+When created, `BufferPool` preallocates a user-defined capacity of bytes in the heap using a
+`Vec<u8>`. When `get_data_owned(..)` is called, it creates a `Slice` that contains a view into the
+preallocated memory. `BufferPool` guarantees that slices never overlap and maintains unique
+ownership of each `Slice`.
 
-Proxies and pools with thousands of connections should use `Decoder<BufferPool>` rather than
-`Decoder<BufferFromSystemMemory>`
+`Slice` implements the `Drop`, allowing the view into the preallocated memory to be reused upon
+dropping.
 
-`BufferPool` when created preallocate a user-defined capacity of bytes in the heap using a
-`Vec<u8>`, then when `get_data_owned(..)` is called it create a `Slice` that contains a view into
-the preallocated memory. `BufferPool` guarantees that slices never overlap and the uniqueness of
-the `Slice` ownership.
+### Buffer Management and Allocation
 
-`Slice` implements `Drop` so that the view into the preallocated memory can be reused.
+`BufferPool` is useful for working with sequentially processed buffers, such as filling a buffer,
+retrieving it, and then reusing it as needed. `BufferPool` optimizes for memory reuse by providing
+pre-allocated memory that can be used in one of three modes:
 
-### Fragmentation overflow and optimization
-`BufferPool` can allocate a maximum of 8 `Slices` (cause it uses an `AtomicU8` to keep track of the
-used and freed slots) and at maximum `capacity` bytes. Whenever all the 8 slots are tacked or there
-is no more space on the preallocated memory `BufferPool` failover to a `BufferFromSystemMemory`.
+1. **Back Mode**: Default mode where allocations start from the back of the buffer.
+2. **Front Mode**: Used when slots at the back are full but memory can still be reused by moving to
+   the front.
+3. **Alloc Mode**: Falls back to system memory allocation (`BufferFromSystemMemory`) when both back
+   and front sections are full, providing additional capacity but with reduced performance.
 
-Usually, a message is decoded then a response is sent then a new message is decoded, etc.
-So `BufferPool` is optimized for use all the slots then check if the first slot has been dropped
-If so use it, then check if the second slot has been dropped, and so on.
-`BufferPool` is also optimized to drop all the slices.
-`BufferPool` is also optimized to drop the last slice.
+`BufferPool` can only be fragmented between the front and back and between back and end.
 
-Below a graphical representation of the most optimized cases:
-```
-A number [0f] means that the slot is taken, the minus symbol (-) means the slot is free
-There are 8 slot
+#### Fragmentation, Overflow, and Optimization
+`BufferPool` can allocate a maximum of `8` `Slice`s (as it uses an `AtomicU8` to track used and
+freed slots) and up to the defined capacity in bytes. If all `8` slows are taken or there is no more
+space in the preallocated memory, `BufferPool` falls back to `BufferFromSystemMemory`.
+
+Typically, `BufferPool` is used to process messages sequentially (decode, respond, decode). It is
+optimized to check for any freed slots starting from the beginning, then reuse these before
+considering further allocation. It is also optimized to drop all the slices and to drop the last
+slice. It also efficiently handles scenarios where all slices are dropped or when the last slice is
+released, reducing memory fragmentation.
+
+The following cases illustrate typical memory usage patterns within `BufferPool`:
+1. Slots fill from back to front, switching as each area reaches capacity.
+2. Pool resets upon full usage, then reuses back slots.
+3. After filling the back, front slots are used when they become available.
+
+Below is a graphical representation of the most optimized cases. A number (`0f`) means that the
+slot is taken, the minus symbol (`-`) means the slot is free. There are `8` slots.
 
+```
 CASE 1
 --------  BACK MODE
 1-------  BACK MODE
@@ -147,30 +196,60 @@ CASE 3
 9a3456-- SWITCH TO BACK MODE
 9a3456b- BACK MODE
 9a3456bc BACK MODE
+```
+
+## Benchmarks and Performance
+
+To run benchmarks, execute:
 
+```
+cargo bench --features criterion
 ```
 
-`BufferPool` can operate in three modalities:
-1. Back: it allocates in the back of the inner vector
-2. Front: it allocates in the front of the inner vector
-3. Alloc: failover to `BufferFromSystemMemory`
+## Benchmarks Comparisons
 
-`BufferPool` can be fragmented only between front and back and between back and end.
+`BufferPool` is benchmarked against `BufferFromSystemMemory` and two additional structure for
+reference: `PPool` (a hashmap-based pool) and `MaxEfficeincy` (a highly optimized but unrealistic
+control implementation writen such that the benchmarks do not panic and the compiler does not
+complain). `BufferPool` generally provides better performance and lower latency than `PPool` and
+`BufferFromSystemMemory`.
 
-### Performance
+**Note**: Both `PPool` and `MaxEfficeincy` are completely broken and are only useful as references
+for the benchmarks.
 
-To run the benchmarks `cargo bench --features criterion`.
+### `BENCHES.md` Benchmarks
+The `BufferPool` always outperforms the `PPool` (hashmap-based pool) and the solution without a
+pool.
 
-To have an idea of the performance gains, `BufferPool` is benchmarked against
-`BufferFromSystemMemory` and two control structures `PPool` and `MaxEfficeincy`.
+Executed for 2,000 samples:
 
-`PPool` is a buffer pool implemented with a hashmap and `MaxEfficeincy` is a `Buffer` implemented in the
-fastest possible way so that the benches do not panic and the compiler does not complain. Btw they are
-both completely broken, useful only as references for the benchmarks.
+```
+* single thread with  `BufferPool`: ---------------------------------- 7.5006 ms
+* single thread with  `BufferFromSystemMemory`: ---------------------- 10.274 ms
+* single thread with  `PPoll`: --------------------------------------- 32.593 ms
+* single thread with  `MaxEfficeincy`: ------------------------------- 1.2618 ms
+* multi-thread with   `BufferPool`: ---------------------------------- 34.660 ms
+* multi-thread with   `BufferFromSystemMemory`: ---------------------- 142.23 ms
+* multi-thread with   `PPoll`: --------------------------------------- 49.790 ms
+* multi-thread with   `MaxEfficeincy`: ------------------------------- 18.201 ms
+* multi-thread 2 with `BufferPool`: ---------------------------------- 80.869 ms
+* multi-thread 2 with `BufferFromSystemMemory`: ---------------------- 192.24 ms
+* multi-thread 2 with `PPoll`: --------------------------------------- 101.75 ms
+* multi-thread 2 with `MaxEfficeincy`: ------------------------------- 66.972 ms
+```
 
-The benchmarks are:
+### Single Thread Benchmarks
+
+If the buffer is not sent to another context `BufferPool`, it is 1.4 times faster than no pool, 4.3
+time faster than the `PPool`, and 5.7 times slower than max efficiency.
+
+Average times for 1,000 operations:
+
+- `BufferPool`: 7.5 ms
+- `BufferFromSystemMemory`: 10.27 ms
+- `PPool`: 32.59 ms
+- `MaxEfficiency`: 1.26 ms
 
-#### Single thread
 ```
 for 0..1000:
   add random bytes to the buffer
@@ -178,9 +257,18 @@ for 0..1000:
   add random bytes to the buffer
   get the buffer
   drop the 2 buffer
-  ```
+```
+
+### Multi-Threaded Benchmarks (most similar to actual use case)
+
+If the buffer is sent to other contexts, `BufferPool` is 4 times faster than no pool, 0.6 times
+faster than `PPool`, and 1.8 times slower than max efficiency.
+
+- `BufferPool`: 34.66 ms
+- `BufferFromSystemMemory`: 142.23 ms
+- `PPool`: 49.79 ms
+- `MaxEfficiency`: 18.20 ms
 
-#### Multi threads (this is the most similar to the actual use case IMHO)
 ```
 for 0..1000:
   add random bytes to the buffer
@@ -189,9 +277,9 @@ for 0..1000:
   add random bytes to the buffer
   get the buffer
   send the buffer to another thread   -> wait 1 ms and then drop it
-  ```
+```
 
-#### Multi threads 2
+### Multi threads 2
 ```
 for 0..1000:
   add random bytes to the buffer
@@ -201,54 +289,35 @@ for 0..1000:
   get the buffer
   send the buffer to another thread   -> wait 1 ms and then drop it
   wait for the 2 buffer to be dropped
-  ```
-
-#### Test
-Some failing cases from fuzz.
-
-#### From the benchmark in BENCHES.md executed for 2000 samples:
-```
-* single thread with  `BufferPool`: ---------------------------------- 7.5006 ms
-* single thread with  `BufferFromSystemMemory`: ---------------------- 10.274 ms
-* single thread with  `PPoll`: --------------------------------------- 32.593 ms
-* single thread with  `MaxEfficeincy`: ------------------------------- 1.2618 ms
-* multi-thread with   `BufferPool`: ---------------------------------- 34.660 ms
-* multi-thread with   `BufferFromSystemMemory`: ---------------------- 142.23 ms
-* multi-thread with   `PPoll`: --------------------------------------- 49.790 ms
-* multi-thread with   `MaxEfficeincy`: ------------------------------- 18.201 ms
-* multi-thread 2 with `BufferPool`: ---------------------------------- 80.869 ms
-* multi-thread 2 with `BufferFromSystemMemory`: ---------------------- 192.24 ms
-* multi-thread 2 with `PPoll`: --------------------------------------- 101.75 ms
-* multi-thread 2 with `MaxEfficeincy`: ------------------------------- 66.972 ms
 ```
 
-From the above numbers, it results that `BufferPool` always outperform the hashmap buffer pool and
-the solution without a pool:
+## Fuzz Testing
+Install `cargo-fuzz` with:
 
-#### Single thread
-If the buffer is not sent to another context `BufferPool` is 1.4 times faster than no pool and 4.3 time
-faster than the hashmap pool and 5.7 times slower than max efficiency.
-
-#### Multi threads
-If the buffer is sent to other contexts  `BufferPool` is 4 times faster than no pool, 0.6 times faster
-than the hashmap pool and 1.8 times slower than max efficiency.
+```bash
+cargo install cargo-fuzz
+```
 
-### Fuzzy tests
-Install cargo fuzz with `cargo install cargo-fuzz`
+Run the fuzz tests:
 
-Then do `cd ./fuzz`
+```bash
+cd ./fuzz
+cargo fuzz run slower -- -rss_limit_mb=5000000000
+cargo fuzz run faster -- -rss_limit_mb=5000000000
+```
+The test must be run with `-rss_limit_mb=5000000000` as this flag checks `BufferPool` with
+capacities from `0` to `2^32`.
 
-Run them with `cargo fuzz run slower -- -rss_limit_mb=5000000000` and
-`cargo fuzz run faster -- -rss_limit_mb=5000000000`
+`BufferPool` is fuzz-tested to ensure memory reliability across different scenarios, including
+delayed memory release and cross-thread access. The tests checks if slices created by `BufferPool`
+still contain the same bytes contained at creation time after a random amount of time and after it
+has been sent to other threads.
 
-`BufferPool` is fuzzy tested with `cargo fuzzy`. The test checks if slices created by `BufferPool`
-still contain the same bytes contained at creation time after a random amount of time and after been
-sent to other threads. There are 2 fuzzy test, the first (faster) it map a smaller input space to
-test the most likely inputs, the second (slower) it have a bigger input space to pick "all" the
-corner case. The slower also forces the buffer to be sent to different cores. I run both for several
-hours without crashes.
+There are 2 fuzzy test, the first (faster) it map a smaller input space to
+Two main fuzz tests are provided:
 
-The test must be run with `-rss_limit_mb=5000000000` cause they check `BufferPool` with capacities
-from 0 to 2^32.
+1. Faster: Maps a smaller input space to test the most likely inputs
+2. Slower: Has a bigger input space to explore "all" the edge case. It forces the buffer to be sent
+   to different cores.
 
-(1) TODO check if is always true
+Both tests have been run for several hours without crashes.