-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Proposal for a Disk-based Tiered Caching Mechanism in OpenSearch #9001
Comments
Thanks for the proposal! |
@Bukhtawar off-heap tier does make sense but still constrained by the memory for larger datasets. Whereas disk-tier might not be constrained by that, tradeoff being latency. Though as part of this, we are also considering giving off-heap tier as an option. |
I am not opposed to the disk tier, the point I am suggesting is we used a tiered approach heap -> off-heap -> disc based on access patterns and space/memory constraints |
@Bukhtawar Make sense! |
I'm writing to propose a new caching approach for OpenSearch that could significantly enhance its performance.
OpenSearch is used primarily for two purposes:
When dealing with log analytics, there's a consistent pattern where the indexed documents are time bound and progress into the future. For instance, if a query is generated to find the count of 4xx errors between two timestamps (T1 and T2, with T2 in the past), the result will invariably remain the same. This attribute presents an opportunity to cache the computed result, allowing for faster retrieval and a reduced processing load.
Presently, OpenSearch incorporates three types of in-memory, bounded caches:
We limited these caches in size and subject to eviction as new or more frequent search requests demand cache space. However, this eviction mechanism may cause recomputation of certain queries, adding to the overall system's overhead.
Given the limitations of the current caching strategy, I propose implementing an optional disk-based caching tier. This tier could leverage either a remote data store (such as Amazon S3, Azure Blob Storage etc.), the disk on the node where the shard lives, or a combination of both.
The rationale for this proposal stems from our hypothesis that the cost of recomputation exceeds the expense incurred during a disk seek operation or making a call to an external storage. Introducing a disk-based cache tier would significantly reduce the need for such recomputations, leading to more efficient query processing and improved system performance.
Kindly review the proposal and provide feedback. We believe that this approach to caching would enhance OpenSearch's performance, especially in scenarios where high data throughput and fast query processing are of paramount importance.
The text was updated successfully, but these errors were encountered: