Merge pull request #65 from kube-rs/blog-memory

Add a blog post about memory performance improvements
kube-rs · Jun 12, 2024 · 8e8365e · 8e8365e
2 parents 043ff26 + 7a97293
commit 8e8365e
Show file tree

Hide file tree

Showing 5 changed files with 205 additions and 2 deletions.
diff --git a/docs/blog/.authors.yml b/docs/blog/.authors.yml
@@ -0,0 +1,6 @@
+authors:
+  clux:
+    name: Eirik
+    description: Maintainer
+    avatar: https://github.com/clux.png
+    url: https://github.com/clux
diff --git a/docs/blog/index.md b/docs/blog/index.md
@@ -0,0 +1 @@
+# Blog
diff --git a/docs/blog/posts/2024-06-11-reflector-memory.md b/docs/blog/posts/2024-06-11-reflector-memory.md
@@ -0,0 +1,192 @@
+---
+authors:
+  - clux
+date: 2024-06-11
+description: >
+  Free memory optimizations available in 0.92.0 for users of kube::runtime
+---
+
+# Watcher Memory Improvements
+
+In [0.92.0](https://github.com/kube-rs/kube/releases/tag/0.92.0) [watcher] dropped its internal buffering of state and started to fully delegating any potential buffering to the associated [Store].
+
+This has resulted in a pretty big memory improvement for direct users of [watcher], but also (somewhat unintuitively) for users of reflectors and stores.
+
+Why does this change improve all cases? Why did we buffer in the first place?
+
+<!-- more -->
+
+## Runtime Memory Performance
+
+The memory profile of any application using `kube::runtime` is often dominated by the memory usage from buffers of the Kubernetes objects that is needed to be watched. The main offender is the [reflector], with a literal `type Cache<K> = Arc<RwLock<AHashMap<ObjectRef<K>, Arc<K>>>>` hiding internally as the lookup used by [Store]s and [Controller]s.
+
+We have lots of advice on how to __reduce the size of this cache__. The [[optimization]] guide shows how to:
+
+- **minimize what you watch** :: by constraining watch parameters with selectors
+- **minimize what you ask for** :: use [metadata_watcher] on watches that does not need the .spec
+- **minimize what you store** :: by dropping fields before sending to stores
+
+These are quick and easy steps improve the memory profile that is worth checking out (the benefits of doing these will further increase in 0.92).
+
+Improving what is stored in the `Cache` above is important, but it is not the full picture...
+
+## The Watch API
+
+The Kubernetes watch API is an interesting beast. You have no guarantees you'll get every event, and you must be able to restart from a potentially new checkpoint without telling you what changes happened in the downtime. This is mentioned briefly in [Kubernetes API concepts](https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes) as an implication of its `410 Gone` responses.
+
+When `410 Gone` responses happen we need to trigger a re-list, and wait for all data to come through before we are back in a "live watching" mode that is caught up with reality. This type of API consumption is problematic when you need to do work with reflectors/caches where you are generally storing __complete__ snapshots in memory for a worker task. Controllers are effectively forced to treat every event as a potential change, and chase [[reconciler#idempotency]] as a work-around for not having guaranteed delivery.
+
+Let's focus on caches. To simplify these problems for users we have created certain guarantees in the abstractions of `kube::runtime`.
+
+## Guarantees
+
+The [watcher] up until 0.92.0 has maintained a guarantee we have casually referred to as __watcher atomicity__:
+
+!!! note "watcher atomicity < 0.92.0"
+
+    You only see a `Restarted` on re-lists once every object has been received through an `api.list`.
+    Watcher events will pause between a de-sync / restart and a `Restarted`. See [watcher::[email protected]](https://docs.rs/kube/0.91.0/kube/runtime/watcher/enum.Event.html).
+
+This property meant that stores could in-turn provide their own guarantee very easily:
+
+!!! note "Store completeness"
+
+    [Store] always presents the full state once initialised. During a relist, previous state is presented.
+    There is no down-time for a store during relists, and its `Cache` is replaced __atomically__ in a single locked step.
+
+This property is needed for Controllers who rely on complete information and will kick in once the future from [Store::wait_until_ready] resolves.
+
+## Consequences
+
+If we do all the buffering on the `watcher` side, then achieving the store completeness guarantee is a rather trivial task to accomplish.
+
+Up until 0.91 this was handled in [`Store::[email protected]`](https://github.com/kube-rs/kube/blob/5dbae3a18c14a225d2d993b9effd16147fef420e/kube-runtime/src/reflector/store.rs#L96-L121) as:
+
+```rust
+// 0.91 source:
+        match event {
+            watcher::Event::Applied(obj) => {
+                let key = obj.to_object_ref(self.dyntype.clone());
+                let obj = Arc::new(obj.clone());
+                self.store.write().insert(key, obj);
+            }
+            watcher::Event::Deleted(obj) => {
+                let key = obj.to_object_ref(self.dyntype.clone());
+                self.store.write().remove(&key);
+            }
+            watcher::Event::Restarted(new_objs) => {
+                let new_objs = new_objs
+                    .iter()
+                    .map(|obj| (obj.to_object_ref(self.dyntype.clone()), Arc::new(obj.clone())))
+                    .collect::<AHashMap<_, _>>();
+                *self.store.write() = new_objs;
+            }
+        }
+```
+
+Thus, on a relist/restart:
+
+1. watcher pages were [buffered internally](https://github.com/kube-rs/kube/blob/5dbae3a18c14a225d2d993b9effd16147fef420e/kube-runtime/src/watcher.rs#L119-L124)
+2. entered `Restarted` arm, where each object got cloned while creating `new_objs`
+3. store (containing the complete old data) swapped at the very end
+
+so you have a moment with **3x** potential peak memory use (**2x** should have been the max).
+
+On top of that, the buffer in the `watcher` was not always released (quote from [discord](https://discord.com/channels/500028886025895936/1234736869317673022)):
+
+> The default system allocator never returns the memory to the OS after the burst, even if the objects are dropped. Since the initial list fetch happens sporadically you get a higher RSS usage together with the memory spike. Solving the burst will solve this problem, and reflectors and watchers can be started in parallel without worrying of OOM killers.
+> The allocator does not return the memory to the OS since it treats it as a cache. This is mitigated by using jemalloc with some tuning, however, you still get the memory burst so our solution was to use jemalloc + start the watchers sequentially. As you can imagine it's not ideal.
+
+So in the end you have memory performance that is actually holding on to between 2x and 3x the actual store size at all times.
+
+!!! note "watcher guarantee was designed for the store guarantee"
+
+    If you were using `watcher` without `reflector`, you were the most affected by this excessive caching. You might not have needed __watcher atomicity__, as it was primarily designed to facilitate __store completeness__.
+
+## Change in 0.92
+
+The change in 0.92.0 is primarily to **stop buffering events in the `watcher`**, and present __new watcher events__ that allows a store to achieve the Store completeness guarantee.
+
+As it stands the [`Store::[email protected]`](https://github.com/kube-rs/kube/blob/0ac1d07d073cc261af767c7f2b9bbf0629fca323/kube-runtime/src/reflector/store.rs#L99-L136) now is slightly smarter and achieves the same guarantee:
+
+```rust
+// 0.92 source
+        match event {
+            watcher::Event::Apply(obj) => {
+                let key = obj.to_object_ref(self.dyntype.clone());
+                let obj = Arc::new(obj.clone());
+                self.store.write().insert(key, obj);
+            }
+            watcher::Event::Delete(obj) => {
+                let key = obj.to_object_ref(self.dyntype.clone());
+                self.store.write().remove(&key);
+            }
+            watcher::Event::Init => {
+                self.buffer = AHashMap::new();
+            }
+            watcher::Event::InitApply(obj) => {
+                let key = obj.to_object_ref(self.dyntype.clone());
+                let obj = Arc::new(obj.clone());
+                self.buffer.insert(key, obj);
+            }
+            watcher::Event::InitDone => {
+                let mut store = self.store.write();
+                std::mem::swap(&mut *store, &mut self.buffer);
+                self.buffer = AHashMap::new();
+                /// ...
+            }
+        }
+```
+
+Thus, on a restart, objects are passed one-by-one up to the store, and buffered therein. When all objects are received, the buffers are swapped (meaning you use at most 2x the data). The blank buffer re-assignment [also forces de-allocation](https://github.com/kube-rs/kube/pull/1494#discussion_r1602840218) of the temporary `self.buffer`.
+
+!!! note "Preparing for StreamingLists"
+
+    Note that the new partial `InitApply` event only pass up __individual__ objects, not pages. This is to prepare for the [1.27 Alpha StreamingLists](https://kubernetes.io/docs/reference/using-api/api-concepts/#streaming-lists) feature which also passed individual events. Once this becomes available for even our minimum [[kubernetes-version]] we can make this the default - reducing page buffers further - exposing the literal api results rather than pages (of [default 500 objects](https://docs.rs/kube/0.91.0/kube/runtime/watcher/struct.Config.html#structfield.page_size)). In the mean time, we send pages through item-by-item to avoid a breaking change in the future (and also to avoid exposing the confusing concept of flattened/unflattened streams).
+
+## Results
+
+The initial setup saw **60% improvements** to [synthetic benchmarks](https://github.com/kube-rs/kube/pull/1494#issue-2292501600) when using stores, and **upwards of 80%** when not using stores (when there's nothing to cache), with further incremental improvements when using the `StreamingList` strategy
+
+I have seen [50% drops myself in real-world controllers](https://github.com/kube-rs/kube/pull/1494#issuecomment-2126694967). YMMW, particularly if you are doing a lot of other stuff, but please [reach out](https://discord.gg/tokio) with more results.
+
+## Thoughts for the future
+
+The fact that you can get >80% percent improvements from not using stores does hint at a further future optimization, allowing users to opt-out of the "store completeness" guarantee.
+
+!!! note "Store Tradeoffs"
+
+    It is possibly to build custom stores that avoids the buffering of objects on restarts by dropping the store completeness guarantee. This is not practical yet for `Controller` uses, due to requirements on `Store` types, but perhaps this could be made generic/opt-out in the future.
+
+As a step in the right direction, we would first like to get better visibility of our memory profile with some automated benchmarking. See [kube#1505](https://github.com/kube-rs/kube/issues/1505) for details.
+
+## Breaking Change
+
+Users not matching on `watcher::Event` or building custom stores should not ever need to interact with this and should get the memory improvements for free.
+
+If you are using a custom store **please see the new [watcher::Event]** and make the following changes in `match` arms:
+
+- `Applied` -> `Apply`
+- `Deleted` -> `Delete`
+- `Restarted` -> change to `InitApply` with 2 new arms:
+  * Create new arms for `Init` marking start (allocate a temporary buffer)
+  * buffer objects from `InitApply`
+  * Swap store in `InitDone` and deallocate the old buffer
+
+See the above `Store::apply_watcher_event` code for pointers.
+
+## Previous Improvements
+Memory optimization is a continuing saga and while the numbers herein are considerable, they build upon previous work:
+
+1. Metadata API support in [0.79.0](https://github.com/kube-rs/kube/releases/tag/0.79.0)
+2. Ability to pass minified streams into `Controller` in [0.81.0](https://github.com/kube-rs/kube/releases/tag/0.81.0) documented in [[streams]]
+3. `Controller::owns` relation moved to lighter metadata watches in [0.84.0](https://github.com/kube-rs/kube/releases/tag/0.84.0)
+4. Default pagination of watchers in [0.84.0](https://github.com/kube-rs/kube/releases/tag/0.84.0) via [#1249](https://github.com/kube-rs/kube/pull/1249)
+5. initial streaming list support in [0.86.0](https://github.com/kube-rs/kube/releases/tag/0.86.0)
+6. Remove buffering in watcher in [0.92.0](https://github.com/kube-rs/kube/releases/tag/0.92.0) - today 🎉
+
+Thanks to everyone who contribute to `kube`!
+
+
+--8<-- "includes/abbreviations.md"
+--8<-- "includes/links.md"
diff --git a/includes/links.md b/includes/links.md
@@ -29,6 +29,7 @@
 [DynamicObject]: https://docs.rs/kube/latest/kube/core/struct.DynamicObject.html
 [PartialObjectMeta]: https://docs.rs/kube/latest/kube/core/struct.PartialObjectMeta.html
 [Store]: https://docs.rs/kube/latest/kube/runtime/reflector/struct.Store.html
+[Store::wait_until_ready]: https://docs.rs/kube/latest/kube/runtime/reflector/struct.Store.html#method.wait_until_ready
 [reflector]: https://docs.rs/kube/latest/kube/runtime/fn.reflector.html
 [watcher]: https://docs.rs/kube/latest/kube/runtime/fn.watcher.html
 [metadata_watcher]: https://docs.rs/kube/latest/kube/runtime/fn.metadata_watcher.html

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -110,6 +110,8 @@ nav:
   - website.md
   - tools.md
   - architecture.md
+- Blog:
+  - blog/index.md
 
 markdown_extensions:
   - attr_list # https://squidfunk.github.io/mkdocs-material/reference/images/
@@ -132,8 +134,8 @@ markdown_extensions:
   - pymdownx.betterem:
       smart_enable: all
   - pymdownx.emoji:
-      emoji_index: !!python/name:materialx.emoji.twemoji
-      emoji_generator: !!python/name:materialx.emoji.to_svg
+      emoji_index: !!python/name:material.extensions.emoji.twemoji
+      emoji_generator: !!python/name:material.extensions.emoji.to_svg
   - pymdownx.caret
   - pymdownx.critic
   - pymdownx.details
@@ -155,6 +157,7 @@ plugins:
   - search
   - roamlinks
     #- autolinks
+  - blog
   - exclude:
       glob:
         - "*.tmp"