diff --git a/docs/blog/.authors.yml b/docs/blog/.authors.yml new file mode 100644 index 0000000..1c4d875 --- /dev/null +++ b/docs/blog/.authors.yml @@ -0,0 +1,6 @@ +authors: + clux: + name: Eirik + description: Maintainer + avatar: https://github.com/clux.png + url: https://github.com/clux diff --git a/docs/blog/index.md b/docs/blog/index.md new file mode 100644 index 0000000..05761ac --- /dev/null +++ b/docs/blog/index.md @@ -0,0 +1 @@ +# Blog diff --git a/docs/blog/posts/2024-06-11-reflector-memory.md b/docs/blog/posts/2024-06-11-reflector-memory.md new file mode 100644 index 0000000..729b6c9 --- /dev/null +++ b/docs/blog/posts/2024-06-11-reflector-memory.md @@ -0,0 +1,192 @@ +--- +authors: + - clux +date: 2024-06-11 +description: > + Free memory optimizations available in 0.92.0 for users of kube::runtime +--- + +# Watcher Memory Improvements + +In [0.92.0](https://github.com/kube-rs/kube/releases/tag/0.92.0) [watcher] dropped its internal buffering of state and started to fully delegating any potential buffering to the associated [Store]. + +This has resulted in a pretty big memory improvement for direct users of [watcher], but also (somewhat unintuitively) for users of reflectors and stores. + +Why does this change improve all cases? Why did we buffer in the first place? + + + +## Runtime Memory Performance + +The memory profile of any application using `kube::runtime` is often dominated by the memory usage from buffers of the Kubernetes objects that is needed to be watched. The main offender is the [reflector], with a literal `type Cache = Arc, Arc>>>` hiding internally as the lookup used by [Store]s and [Controller]s. + +We have lots of advice on how to __reduce the size of this cache__. The [[optimization]] guide shows how to: + +- **minimize what you watch** :: by constraining watch parameters with selectors +- **minimize what you ask for** :: use [metadata_watcher] on watches that does not need the .spec +- **minimize what you store** :: by dropping fields before sending to stores + +These are quick and easy steps improve the memory profile that is worth checking out (the benefits of doing these will further increase in 0.92). + +Improving what is stored in the `Cache` above is important, but it is not the full picture... + +## The Watch API + +The Kubernetes watch API is an interesting beast. You have no guarantees you'll get every event, and you must be able to restart from a potentially new checkpoint without telling you what changes happened in the downtime. This is mentioned briefly in [Kubernetes API concepts](https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes) as an implication of its `410 Gone` responses. + +When `410 Gone` responses happen we need to trigger a re-list, and wait for all data to come through before we are back in a "live watching" mode that is caught up with reality. This type of API consumption is problematic when you need to do work with reflectors/caches where you are generally storing __complete__ snapshots in memory for a worker task. Controllers are effectively forced to treat every event as a potential change, and chase [[reconciler#idempotency]] as a work-around for not having guaranteed delivery. + +Let's focus on caches. To simplify these problems for users we have created certain guarantees in the abstractions of `kube::runtime`. + +## Guarantees + +The [watcher] up until 0.92.0 has maintained a guarantee we have casually referred to as __watcher atomicity__: + +!!! note "watcher atomicity < 0.92.0" + + You only see a `Restarted` on re-lists once every object has been received through an `api.list`. + Watcher events will pause between a de-sync / restart and a `Restarted`. See [watcher::Event@0.91](https://docs.rs/kube/0.91.0/kube/runtime/watcher/enum.Event.html). + +This property meant that stores could in-turn provide their own guarantee very easily: + +!!! note "Store completeness" + + [Store] always presents the full state once initialised. During a relist, previous state is presented. + There is no down-time for a store during relists, and its `Cache` is replaced __atomically__ in a single locked step. + +This property is needed for Controllers who rely on complete information and will kick in once the future from [Store::wait_until_ready] resolves. + +## Consequences + +If we do all the buffering on the `watcher` side, then achieving the store completeness guarantee is a rather trivial task to accomplish. + +Up until 0.91 this was handled in [`Store::apply_watcher_event@0.91`](https://github.com/kube-rs/kube/blob/5dbae3a18c14a225d2d993b9effd16147fef420e/kube-runtime/src/reflector/store.rs#L96-L121) as: + +```rust +// 0.91 source: + match event { + watcher::Event::Applied(obj) => { + let key = obj.to_object_ref(self.dyntype.clone()); + let obj = Arc::new(obj.clone()); + self.store.write().insert(key, obj); + } + watcher::Event::Deleted(obj) => { + let key = obj.to_object_ref(self.dyntype.clone()); + self.store.write().remove(&key); + } + watcher::Event::Restarted(new_objs) => { + let new_objs = new_objs + .iter() + .map(|obj| (obj.to_object_ref(self.dyntype.clone()), Arc::new(obj.clone()))) + .collect::>(); + *self.store.write() = new_objs; + } + } +``` + +Thus, on a relist/restart: + +1. watcher pages were [buffered internally](https://github.com/kube-rs/kube/blob/5dbae3a18c14a225d2d993b9effd16147fef420e/kube-runtime/src/watcher.rs#L119-L124) +2. entered `Restarted` arm, where each object got cloned while creating `new_objs` +3. store (containing the complete old data) swapped at the very end + +so you have a moment with **3x** potential peak memory use (**2x** should have been the max). + +On top of that, the buffer in the `watcher` was not always released (quote from [discord](https://discord.com/channels/500028886025895936/1234736869317673022)): + +> The default system allocator never returns the memory to the OS after the burst, even if the objects are dropped. Since the initial list fetch happens sporadically you get a higher RSS usage together with the memory spike. Solving the burst will solve this problem, and reflectors and watchers can be started in parallel without worrying of OOM killers. +> The allocator does not return the memory to the OS since it treats it as a cache. This is mitigated by using jemalloc with some tuning, however, you still get the memory burst so our solution was to use jemalloc + start the watchers sequentially. As you can imagine it's not ideal. + +So in the end you have memory performance that is actually holding on to between 2x and 3x the actual store size at all times. + +!!! note "watcher guarantee was designed for the store guarantee" + + If you were using `watcher` without `reflector`, you were the most affected by this excessive caching. You might not have needed __watcher atomicity__, as it was primarily designed to facilitate __store completeness__. + +## Change in 0.92 + +The change in 0.92.0 is primarily to **stop buffering events in the `watcher`**, and present __new watcher events__ that allows a store to achieve the Store completeness guarantee. + +As it stands the [`Store::apply_watcher_event@0.92`](https://github.com/kube-rs/kube/blob/0ac1d07d073cc261af767c7f2b9bbf0629fca323/kube-runtime/src/reflector/store.rs#L99-L136) now is slightly smarter and achieves the same guarantee: + +```rust +// 0.92 source + match event { + watcher::Event::Apply(obj) => { + let key = obj.to_object_ref(self.dyntype.clone()); + let obj = Arc::new(obj.clone()); + self.store.write().insert(key, obj); + } + watcher::Event::Delete(obj) => { + let key = obj.to_object_ref(self.dyntype.clone()); + self.store.write().remove(&key); + } + watcher::Event::Init => { + self.buffer = AHashMap::new(); + } + watcher::Event::InitApply(obj) => { + let key = obj.to_object_ref(self.dyntype.clone()); + let obj = Arc::new(obj.clone()); + self.buffer.insert(key, obj); + } + watcher::Event::InitDone => { + let mut store = self.store.write(); + std::mem::swap(&mut *store, &mut self.buffer); + self.buffer = AHashMap::new(); + /// ... + } + } +``` + +Thus, on a restart, objects are passed one-by-one up to the store, and buffered therein. When all objects are received, the buffers are swapped (meaning you use at most 2x the data). The blank buffer re-assignment [also forces de-allocation](https://github.com/kube-rs/kube/pull/1494#discussion_r1602840218) of the temporary `self.buffer`. + +!!! note "Preparing for StreamingLists" + + Note that the new partial `InitApply` event only pass up __individual__ objects, not pages. This is to prepare for the [1.27 Alpha StreamingLists](https://kubernetes.io/docs/reference/using-api/api-concepts/#streaming-lists) feature which also passed individual events. Once this becomes available for even our minimum [[kubernetes-version]] we can make this the default - reducing page buffers further - exposing the literal api results rather than pages (of [default 500 objects](https://docs.rs/kube/0.91.0/kube/runtime/watcher/struct.Config.html#structfield.page_size)). In the mean time, we send pages through item-by-item to avoid a breaking change in the future (and also to avoid exposing the confusing concept of flattened/unflattened streams). + +## Results + +The initial setup saw **60% improvements** to [synthetic benchmarks](https://github.com/kube-rs/kube/pull/1494#issue-2292501600) when using stores, and **upwards of 80%** when not using stores (when there's nothing to cache), with further incremental improvements when using the `StreamingList` strategy + +I have seen [50% drops myself in real-world controllers](https://github.com/kube-rs/kube/pull/1494#issuecomment-2126694967). YMMW, particularly if you are doing a lot of other stuff, but please [reach out](https://discord.gg/tokio) with more results. + +## Thoughts for the future + +The fact that you can get >80% percent improvements from not using stores does hint at a further future optimization, allowing users to opt-out of the "store completeness" guarantee. + +!!! note "Store Tradeoffs" + + It is possibly to build custom stores that avoids the buffering of objects on restarts by dropping the store completeness guarantee. This is not practical yet for `Controller` uses, due to requirements on `Store` types, but perhaps this could be made generic/opt-out in the future. + +As a step in the right direction, we would first like to get better visibility of our memory profile with some automated benchmarking. See [kube#1505](https://github.com/kube-rs/kube/issues/1505) for details. + +## Breaking Change + +Users not matching on `watcher::Event` or building custom stores should not ever need to interact with this and should get the memory improvements for free. + +If you are using a custom store **please see the new [watcher::Event]** and make the following changes in `match` arms: + +- `Applied` -> `Apply` +- `Deleted` -> `Delete` +- `Restarted` -> change to `InitApply` with 2 new arms: + * Create new arms for `Init` marking start (allocate a temporary buffer) + * buffer objects from `InitApply` + * Swap store in `InitDone` and deallocate the old buffer + +See the above `Store::apply_watcher_event` code for pointers. + +## Previous Improvements +Memory optimization is a continuing saga and while the numbers herein are considerable, they build upon previous work: + +1. Metadata API support in [0.79.0](https://github.com/kube-rs/kube/releases/tag/0.79.0) +2. Ability to pass minified streams into `Controller` in [0.81.0](https://github.com/kube-rs/kube/releases/tag/0.81.0) documented in [[streams]] +3. `Controller::owns` relation moved to lighter metadata watches in [0.84.0](https://github.com/kube-rs/kube/releases/tag/0.84.0) +4. Default pagination of watchers in [0.84.0](https://github.com/kube-rs/kube/releases/tag/0.84.0) via [#1249](https://github.com/kube-rs/kube/pull/1249) +5. initial streaming list support in [0.86.0](https://github.com/kube-rs/kube/releases/tag/0.86.0) +6. Remove buffering in watcher in [0.92.0](https://github.com/kube-rs/kube/releases/tag/0.92.0) - today 🎉 + +Thanks to everyone who contribute to `kube`! + + +--8<-- "includes/abbreviations.md" +--8<-- "includes/links.md" diff --git a/includes/links.md b/includes/links.md index 00ffbba..f367af3 100644 --- a/includes/links.md +++ b/includes/links.md @@ -29,6 +29,7 @@ [DynamicObject]: https://docs.rs/kube/latest/kube/core/struct.DynamicObject.html [PartialObjectMeta]: https://docs.rs/kube/latest/kube/core/struct.PartialObjectMeta.html [Store]: https://docs.rs/kube/latest/kube/runtime/reflector/struct.Store.html +[Store::wait_until_ready]: https://docs.rs/kube/latest/kube/runtime/reflector/struct.Store.html#method.wait_until_ready [reflector]: https://docs.rs/kube/latest/kube/runtime/fn.reflector.html [watcher]: https://docs.rs/kube/latest/kube/runtime/fn.watcher.html [metadata_watcher]: https://docs.rs/kube/latest/kube/runtime/fn.metadata_watcher.html diff --git a/mkdocs.yml b/mkdocs.yml index 9896eeb..2033d5a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -110,6 +110,8 @@ nav: - website.md - tools.md - architecture.md +- Blog: + - blog/index.md markdown_extensions: - attr_list # https://squidfunk.github.io/mkdocs-material/reference/images/ @@ -132,8 +134,8 @@ markdown_extensions: - pymdownx.betterem: smart_enable: all - pymdownx.emoji: - emoji_index: !!python/name:materialx.emoji.twemoji - emoji_generator: !!python/name:materialx.emoji.to_svg + emoji_index: !!python/name:material.extensions.emoji.twemoji + emoji_generator: !!python/name:material.extensions.emoji.to_svg - pymdownx.caret - pymdownx.critic - pymdownx.details @@ -155,6 +157,7 @@ plugins: - search - roamlinks #- autolinks + - blog - exclude: glob: - "*.tmp"