[Discuss] Design for Tiered File Cache & Block Level fetch #9987

ankitkala · 2023-09-12T07:52:31Z

Coming from RFC for Multi-tiered File Cache, this issue captures the low level components and the flows.

Multi-tiered File Cache works along with Directory abstraction(for storing lucene segment files) which allows lucene indices(a.k.a shards) to be agnostic of the data locality(memory, local disk, remote store). Internally it implements a Multi Tiered Cache(Tier 1 Memory mapped, Tier 2 Disk, Tier 3 Remote store) and takes care of managing the data across all tiers as well as its movement.

Use case:

Ideally, all types of indices(local/warm/remote backed hot) can be managed with the File Cache where depending on shard type, tiers and its promotion criteria can be different. For current scope however, we intend to enable this support for remote based warm indices as it gains the maximum benefit from this abstraction.

This is helpful as it'll allow us to write the data to a warm shard without loading the entire data thereby improving the time to recovery. Similarly, we can also explore enabling support for warm shards with 0 replicas (for non-critical indices). Another benefit(with Block level fetch) is the lower memory footprint of the shard on the node.

This component will allows us to lazy load the shard’s data for read and write on-demand basis. This is helpful as it'll allow us to write the data to a warm shard without loading the entire data thereby improving the time to recovery. Similarly, we can also explore enabling support for warm shards with 0 replicas (for non-critical indices). Another benefit(with Block level fetch) is the lower memory footprint of the shard on the node.

This component will also allow us to introduce a working set model for every shard, where working set for a shard is defined as the set of files (or blocks of files) which a shard is currently using (for reads/writes/merges etc). With working set model the shards will have the capability to lazy load the data needed into the shard’s working set. Since working set of a shard is going to be a function of time it is expected that the files (or blocks of files) part of working set are going to be evicted and added with time in the file-cache on-demand basis.

Caveats:

Tier 1 boundary is not well defined. Data can be memory mapped but system can still thrash the data to disk.
Tier 2 to Tier 3 promotion is not handled internally. With the current implementation, the hooks & logic to migrate data to remote store is outside the abstractions. We can revisit in later iterations.
Tier 3 lifecycle is not really managed in File Cache(at least yet) (e.g. deleting stale data, deleting complete data on remote).

Tier 3 is the source of truth for the shard. As soon as new data gets uploaded to the remote store, File cache starts tracking the new files.

How it works

File Cache will track/maintain the list of files for all shards(using composite directory) at node level.
Internally it’ll maintain a TierMap to track files and the respective tiers.
For Tier 1 and Tier 2, it’ll maintain separate LRU caches (for disk and memory). Both having their own cache lifecycle manager and policies.
- Cache Lifecycle Manager would be responsible to moving files across Tiers.
- We’ll also support disabling eviction from Tier 2 if we want to keep all data as hot.
In Tier 1, each entry is an open refcounted IndexInput ensuring that data is in memory. For Tier 2, we can just track the files and the access patterns(frequency of use, last used/ttl, etc)
Both Tier 1 and Tier 2 will also maintain separate stats metrics around its usage, evictions, hit rate, miss rate etc at a shard & node level.

User Flows:

Write Flow / Start tracking a file.

A new file is created in local Store.
Files get uploaded to Remote Store.
Upon upload, we invoke CompositeDirectory.afterUpload to start tracking the new files in TierMap.

Read a file.

CompositeDirectory.openInput is invoked to get a IndexInput for the file.
- Lookup the file in TierMap
- If the file is present in Tier 1, return the IndexInput from FileCache (internally, it also rotates the file to the back of LRU).
- If present in Tier 2, move to Tier1 and open the file(as a block) as memory mapped in tier 1.
  - We can explore not memory mapping the files based on context(e.g. Segment merges) to avoid disruptions in Tier 1.
- If present in Tier 3, move to Tier2 and subsequently to Tier 1.
  - Tier 3 to Tier 2 will always be block files instead of complete files.

Movement across Tiers:

Tier promotion is on-demand.
- We also want to maintain additional metadata (i.e. working set of minimum file required locally to open the shard). These can be fetched before a shard is started.
Tier 1 to Tier 2 (assuming refCount = 0)
- LRU based eviction.
- TTL expiry based.
- Available memory space based eviction (e.g. 80% of total available space).
Tier 2 to Tier 3:
- TTL expiry based.
- Available disk space based eviction (e.g. 80% of total available space).
- Optional setting for never evicting the files from Tier 2.
Tier3 lifecycle policies(not in scope)
- Remove stale commits (older than X days/ only last X commits)

Future Enhancements:

Integration with Remote backed hot indices.
Tier 3 lifecycle management(promotion, stale data cleanup, index deletion).
Working Set enhancements/optimizations.
Block Prefetching for performance improvements. e.g:
- Always prefetch header and footer blocks.
- Prefetch the next block (Spatial locality).
- File cache lifecycle based on IOContext.
- DirectIO for merges to avoid disruptions in File Cache.

ankitkala · 2023-09-12T16:29:02Z

@shwetathareja @biesps @andrross

andrross · 2023-09-15T23:44:41Z

Thanks @ankitkala, this is great! A few questions:

Is AbstractBlockedIndexInputFile the same as the existing abstract OnDemandBlockIndexInput?
I love the "Available disk space based eviction (e.g. 80% of total available space)" feature, assuming I understand it correctly. The current implementation of searchable snapshots defines a fixed portion of disk to use as the file cache, but I think we can do better. How will this prevent over subscribing a node? If enough indexes are marked to never evict from tier 2 (i.e. "hot"), then you can in theory use close enough to 80% of the available disk such that your warm indexes thrash by fetching from remote and immediately evict.
I would love to see us remove the segmented cache as it adds complexity and is still vulnerable to hot keys. My theory is that the difference in performance between using a segmented cache versus just a single LRUCache is negligible in practice for this use case, but I haven't run the performance tests to prove it yet. This may be worth exploring before we make major changes to the cache (for example by adding different eviction algorithms) because the segmented approach does add complexity.

Bukhtawar · 2023-09-16T13:06:05Z

Thanks @ankitkala

For Tier 1 and Tier 2, it’ll maintain separate LRU caches (for disk and memory). Both having their own cache lifecycle manager and policies.

You might want to consider ARC for eviction strategy.

Tier 1 boundary is not well defined. Data can be memory mapped but system can still thrash the data to disk.

How do we ensure Tier 1 and Tier 2 caches are correctly accounted for, thrashing will cause the mmapped file access to be moved to disk, unless you are can pin/lock the file explicitly.
Also how do plan to implement fair-sharing across all shard copies. Is it possible for one shard to evict caches of other shards espl in multi-tenancy use cases?

Another important thing is, can we design it to be tier agnostic and think of block as the basic unit of cache.

Block Prefetching for performance improvements

For prefetching, it might be worth fetching top-k blocks in advance for better performance. Essentially read the required data block, and an additional amount of data that varies depending on the access pattern. We can start prefetching an extra block and keep doubling it on every subsequent read, upto certain max threshold.

ankitkala · 2023-09-18T03:38:15Z

Is AbstractBlockedIndexInputFile the same as the existing abstract OnDemandBlockIndexInput?

Intent was to have different flavors of blocked index inputs(e.g. open against file in remote v/s local disk).

I love the "Available disk space based eviction (e.g. 80% of total available space)" feature, assuming I understand it correctly. The current implementation of searchable snapshots defines a fixed portion of disk to use as the file cache, but I think we can do better. How will this prevent over subscribing a node? If enough indexes are marked to never evict from tier 2 (i.e. "hot"), then you can in theory use close enough to 80% of the available disk such that your warm indexes thrash by fetching from remote and immediately evict.

Correct. Incase the warm shards and hot shards are on the same node, we can definitely run into this.
Another idea we had was to have file cache also track hot shards thus allowing us full control of the total available space for the data.
One alternative is that we can also have a minimum threshold(let's say 20%). If FileCache usage is less than min threshold, we don't do cache evictions based on overflow criteria(ttl based evictions can still go through)

I would love to see us remove the segmented cache as it adds complexity and is still vulnerable to hot keys. My theory is that the difference in performance between using a segmented cache versus just a single LRUCache is negligible in practice for this use case, but I haven't run the performance tests to prove it yet. This may be worth exploring before we make major changes to the cache (for example by adding different eviction algorithms) because the segmented approach does add complexity.

Sure. Like you called out, we can definitely explore getting rid of segmented cache is favor of a finer grained locking mechanism.

ankitkala · 2023-09-18T03:57:11Z

You might want to consider ARC for eviction strategy.

Interesting!! Let me explore this route as well.

How do we ensure Tier 1 and Tier 2 caches are correctly accounted for, thrashing will cause the mmapped file access to be moved to disk, unless you are can pin/lock the file explicitly. Also how do plan to implement fair-sharing across all shard copies. Is it possible for one shard to evict caches of other shards espl in multi-tenancy use cases?

Yeah, total usage in Tier 1 & Tier 2 are not going to be true indicators of the actual usage.
Regarding the fair sharing, we haven't explored anything yet. I'll think over it once. LMK if you've any ideas.

Another important thing is, can we design it to be tier agnostic and think of block as the basic unit of cache.

Wanted to understand a bit more on the tier agnostic part. IndexInput are the basic unit of cache today. Block as a basic unit makes sense but the downside is that the we will need to store the data on disk also as blocks.

For prefetching, it might be worth fetching top-k blocks in advance for better performance. Essentially read the required data block, and an additional amount of data that varies depending on the access pattern. We can start prefetching an extra block and keep doubling it on every subsequent read, upto certain max threshold.

Ack.

abiesps · 2023-09-20T04:57:35Z

If files are opened for read, there is no isolation between tier1 and tier2 tier2 files can still thrash the blocks of tier1. To minimise this we should

Try to evict 'complete files' (non block files) from local store as soon as possible.
We can also build a capability of reading 'complete files' as block file (by only mmaping (in case of MMapDirectory) the blocks.

Bukhtawar · 2024-02-24T07:48:11Z

Have you also evaluated the cache eviction policy comparing clock-based cache replacement policy and LRU. LRU may have certain deficiencies around cases like, a burst of references to infrequently used blocks, such as sequential scans through large files, may cause the replacement of frequently referenced blocks in cache. An effective replacement algorithm would be able to prevent hot blocks from being evicted by cold blocks.

ankitkala · 2024-03-21T03:01:57Z

I've added this as an enhancement item: #12809

ankitkala added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 12, 2023

kotwanikunal added the Search Search query, autocomplete ...etc label Sep 19, 2023

msfroh added Performance This is for any performance related enhancements or bugs and removed untriaged labels Sep 20, 2023

This was referenced Dec 28, 2023

Define a high level plan for remote search index #11460

Closed

Introduce a new index setting/property for toggling a remote store index as "writable warm” #11703

Open

This was referenced Jan 8, 2024

FileCache integration with warm indices #11789

Closed

Implement CacheLifecycleManager for the FileCache #11790

Closed

[Observability] Expose the FileCache stats & metrics #11791

Open

Support for encryption with FileCache #11792

Closed

ankitkala changed the title ~~[Discuss] Design for Tiered File Cache~~ [Discuss] Design for Tiered File Cache & Block Level fetch Jan 10, 2024

ankitkala closed this as completed Mar 21, 2024

Bukhtawar mentioned this issue May 28, 2024

[RFC] Search performance on warm index #13806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discuss] Design for Tiered File Cache & Block Level fetch #9987

[Discuss] Design for Tiered File Cache & Block Level fetch #9987

ankitkala commented Sep 12, 2023

ankitkala commented Sep 12, 2023

andrross commented Sep 15, 2023

Bukhtawar commented Sep 16, 2023 •

edited

Loading

ankitkala commented Sep 18, 2023

ankitkala commented Sep 18, 2023

abiesps commented Sep 20, 2023

Bukhtawar commented Feb 24, 2024

ankitkala commented Mar 21, 2024

[Discuss] Design for Tiered File Cache & Block Level fetch #9987

[Discuss] Design for Tiered File Cache & Block Level fetch #9987

Comments

ankitkala commented Sep 12, 2023

Use case:

How it works

ankitkala commented Sep 12, 2023

andrross commented Sep 15, 2023

Bukhtawar commented Sep 16, 2023 • edited Loading

ankitkala commented Sep 18, 2023

ankitkala commented Sep 18, 2023

abiesps commented Sep 20, 2023

Bukhtawar commented Feb 24, 2024

ankitkala commented Mar 21, 2024

Bukhtawar commented Sep 16, 2023 •

edited

Loading