-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Cost-effective materialized view for high cardinality data #765
Labels
Comments
dai-chen
added
enhancement
New feature or request
untriaged
Core:MV
and removed
untriaged
labels
Oct 10, 2024
Proof of Concept: Approximate Aggregation ApproachGoalsVerify the feasibility of the Approximate Aggregation MV approach and evaluate its impact on storage, performance and cost, specifically including:
DesignSyntax: CREATE MATERIALIZED VIEW vpc_flow_log_mv AS SELECT window.start AS startTime, activity, APPROX_TOP_COUNT(src_endpoint.ip, 100) AS top_k_src_ip_by_count, APPROX_TOP_COUNT(dst_endpoint.ip, 100) AS top_k_dst_ip_by_count, APPROX_TOP_SUM(src_endpoint.ip, 100) AS top_k_src_ip_by_sum, APPROX_TOP_SUM(dst_endpoint.ip, 100) AS top_k_dst_ip_by_sum, APPROX_TOP_COUNT(ARRAY(src_endpoint.ip, dst_endpoint.ip), 100) AS top_k_src_dst_ip_by_count, COUNT(*) AS total_count, SUM(traffic.bytes) AS total_bytes, SUM(traffic.packets) AS total_packets FROM vpc_flow_logs GROUP BY TUMBLE(eventTime, '1 Day'), activity_name Materialized view data: "_source": { "startTime": "2024-10-01 12:00:00", "activity": "Traffic", "top_k_src_ip_by_count": [ { "ip": "192.168.0.100", "count": 23205 }, ... ] "top_k_dst_ip_by_count": [ { "ip": "127.0.01", "count": 238 } ... ] }, OpenSearch DSL query: POST /vpc_flow_log_approx_mv/_search { "size": 0, "aggs": { "top_ips": { "nested": { "path": "top_k_src_ip_by_count" }, "aggs": { "ip_buckets": { "terms": { "field": "top_k_src_ip_by_count.ip", "size": 100, "order": { "total_count": "desc" } }, "aggs": { "total_count": { "sum": { "field": "top_k_src_ip_by_count.count" } } } } } } } } Implementation Tasks
Testing Tasks
|
Proof of Concept: OpenSearch Index OptimizationsTODO: A quick test showed a 58% reduction in index size by disabling _source and unnecessary inverted indexes. health status index uuid pri rep docs.count docs.deleted store.size pri.store.size vpc_flow_logs_mv_3 oG5_m5IB0HCHA9l1-54T 1000000 0 82.3mb 82.3mb vpc_flow_logs_mv_5 o26Em5IB0HCHA9l1F56g 1000000 0 59.2mb 59.2mb vpc_flow_logs_mv_1 _xJzm5IB1Vv8TQuVfOXQ 1000000 0 140.9mb 140.9mb # Current schema (baseline) PUT vpc_flow_logs_mv_1 { "settings": { "number_of_shards": 1, "number_of_replicas": 2 }, "mappings": { "properties": { "startTime": { "type": "date" }, "activity": { "type": "keyword" }, "src_ip": { "type": "keyword" }, "dest_ip": { "type": "keyword" }, "total_count": { "type": "long" }, "total_bytes": { "type": "long" }, "total_packets": { "type": "long" } } } } # Disable source PUT vpc_flow_logs_mv_3 { "settings": { "number_of_shards": 1, "number_of_replicas": 2 }, "mappings": { "_source": { "enabled": false }, "properties": { "startTime": { "type": "date" }, "activity": { "type": "keyword" }, "src_ip": { "type": "keyword" }, "dest_ip": { "type": "keyword" }, "total_count": { "type": "long" }, "total_bytes": { "type": "long" }, "total_packets": { "type": "long" } } } } # Combine all PUT vpc_flow_logs_mv_5 { "settings": { "number_of_shards": 1, "number_of_replicas": 2 }, "mappings": { "_source": { "enabled": false }, "properties": { "startTime": { "type": "date" }, "activity": { "type": "keyword" }, "src_ip": { "type": "ip" }, "dest_ip": { "type": "ip" }, "total_count": { "type": "long", "index": false }, "total_bytes": { "type": "long", "index": false }, "total_packets": { "type": "long", "index": false } } } } |
dai-chen
changed the title
[FEATURE] Efficient storage of high cardinality data in materialized view
[FEATURE] Cost-effective materialized view for high cardinality data
Nov 4, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem?
When dealing with high cardinality data, Materialized Views (MVs) can become excessively large and inefficient, leading to significant performance and cost challenges. Take VPC Flow dashboard as an example, high cardinality fields like source and destination IP pairs create significant storage challenges when using MVs. At the terabyte (TB) scale, each 1-minute window can result in hundreds of millions of rows after grouping.
Materialized view definition for VPC flow dashboard:
Cost Breakdown
The cost of maintaining MVs for high-cardinality data primarily arises from three areas:
What solution would you like?
Solution Overview
To address the challenges associated with high-cardinality data and dashboard visualizations, we are considering three primary solutions: Aggregate MV, Approximate MV, and Direct Querying. The table below provides a baseline comparison of the cost implications for each solution without additional optimizations.
Design Goals
Our primary objective is to reduce the overall cost of maintaining MVs by optimizing these three aspects — creation, storage, and query — specifically for high-cardinality fields and dashboard visualizations.
Beyond cost reduction, we aim for a balanced solution that effectively manages the trade-offs between cost, latency, and accuracy. This involves addressing the specific challenges of each approach, such as reducing storage requirements in Aggregate MV, enhancing accuracy in Approximate MV, and improving query latency in Direct Querying.
Proposed Path Forward [TBD]
This recommended path forward includes both short-term enhancements to the existing aggregate MV and longer-term solutions that better align with product goals. For more details on each approach, refer to the following Additional Context section.
Short-Term (Aggregate MV)
Long-Term
Long-Term (Product Alignment Required)
Note: The following section provides detailed technical context. Continue if you’re interested in specifics ...
Do you have any additional context?
(I) Aggregate Materialized View
The most straightforward approach follows the core concept of an aggregate materialized view, with further optimizations reducing its size and cost while maintaining accuracy.
a) Partial Materialized View with Backfilling
User Workflow
Cons
Technical Challenges
b) OpenSearch Index Optimizations
Regardless of the chosen approach, additional optimizations can be applied within the OpenSearch index to further reduce both indexing computation and storage requirements. It's essential to validate these optimizations through real testing to ensure they translate into measurable cost reductions.
Disable
_source
: Disabling the_source
field can reduce the index size by approximately 40%, though it comes with side effects such as the loss of certain search functionalities like highlighting. This approach sacrifices some flexibility for better storage efficiency.Disable
docvalues
for non-aggregated fields: For fields used only for filtering on the dashboard,docvalues
can be disabled to save storage, as they are not required for aggregation.Disable inverted index on non-dimension fields: If the index is primarily used to serve pre-canned dashboards, you can consider disabling the inverted index on non-dimension fields to save storage. This is particularly useful if these fields do not need to be used in filters, though they can still be used in aggregations.
Store by the most appropriate field type: Storing fields using the most fitting data types can significantly reduce the storage size. For example, instead of storing as a
keyword
field, using theip
type for IPv4 addresses which are very common in VPC flow log, CloudTrail, WAF dataset.Change compression rate: Adjusting the compression settings in OpenSearch from the default to the best compression option can further reduce the index size, at the cost of slightly slower indexing speeds.
The last item can be configured in
index_settings
option while the first two will be configurable once support for #772 is implemented.c) Materialized View on Cost-Effective Object Store
Cons
Technical Challenges
(II) Approximate Materialized View
The basic idea is to create MVs that store approximate or partial data, significantly reducing storage requirements by capturing only the most essential information while accepting trade-offs in accuracy or completeness.
Cons
Technical Challenges
(III) Direct Querying Dashboard
Perform direct queries on the source data for real-time analysis, similar to Grafana and CloudWatch Log Insights experience, without relying on pre-computed MV.
Pros
Cons
Technical Challenges
The text was updated successfully, but these errors were encountered: