[RFC] Accelerating Spark Workloads by With Materialized Views #886

lucasvj2 · 2024-11-11T08:18:20Z

Problem Overview
In modern database engines, query optimization is key to delivering fast, efficient results. Traditional approaches, such as rule-based or cost-based optimizers and distributed execution engines, work to identify the most efficient execution plan. However, these methods have limitations, particularly in complex, high-volume workloads. To address this, we propose an alternative solution: accelerating query execution through materialized views, leveraging a time-space tradeoff to improve performance.

Proposed Solution
Our approach centers around an architecture that introduces a feedback loop between query plan telemetry and workload-driven optimizations. The solution uses materialized views to precompute and store intermediate query results, which can be reused by future queries to reduce execution time. This methodology can be applied in multiple database contexts, including OpenSearch and MPP/Data Lake engines like Spark, Presto, and Trino.

Key Components of the Architecture:

Workload:
Collect telemetry data, such as execution plans, cardinality, runtime, and memory usage during query execution. These metrics will feed into the view selection process.
Telemetry Preprocessing:
Transform collected telemetry into a uniform representation to standardize data across and optimize the view selection process.
Heuristic-Based View Selection:
Analyze the workload and automatically identify sub-queries that benefit from materialization. We propose a simple heuristic algorithm that selects views based on estimated performance gains and materialization costs
Materialized View Creation and Maintenance:
Materialized views are created and maintained to ensure data consistency. These views store intermediate results (e.g., summary aggregates, sub-query results) that can be reused to speed up query execution.
Implementation Plan:

Focus on VPC Flow Logs: We plan to use VPC flow logs as test data, inserting them into an AWS OpenSearch cluster and generate workloads to test the architecture. We'll collect telemetry by listening to plan execution and gather metrics on cardinality, time, and memory usage. This data will be used to create a workload representation.

Heuristic-Based View Selection: We aim to implement a simple, heuristic-based algorithm for view selection. This algorithm will automatically select views that offer significant performance boosts with minimal materialization costs. By focusing on simplicity, we can ensure rapid implementation and real-time optimization during query execution.

lucasvj2 added enhancement New feature or request untriaged labels Nov 11, 2024

lucasvj2 changed the title ~~[RFC]~~ [RFC] Accelerating Spark Workloads Nov 11, 2024

lucasvj2 changed the title ~~[RFC] Accelerating Spark Workloads~~ [RFC] Accelerating Spark Workloads by Materializing Views Nov 11, 2024

lucasvj2 changed the title ~~[RFC] Accelerating Spark Workloads by Materializing Views~~ [RFC] Accelerating Spark Workloads by With Materialized Views Nov 11, 2024

anasalkouz removed the untriaged label Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Accelerating Spark Workloads by With Materialized Views #886

[RFC] Accelerating Spark Workloads by With Materialized Views #886

lucasvj2 commented Nov 11, 2024 •

edited

Loading

[RFC] Accelerating Spark Workloads by With Materialized Views #886

[RFC] Accelerating Spark Workloads by With Materialized Views #886

Comments

lucasvj2 commented Nov 11, 2024 • edited Loading

lucasvj2 commented Nov 11, 2024 •

edited

Loading