Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Accelerating Spark Workloads by With Materialized Views #886

Open
lucasvj2 opened this issue Nov 11, 2024 · 0 comments
Open

[RFC] Accelerating Spark Workloads by With Materialized Views #886

lucasvj2 opened this issue Nov 11, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@lucasvj2
Copy link

lucasvj2 commented Nov 11, 2024

Problem Overview
In modern database engines, query optimization is key to delivering fast, efficient results. Traditional approaches, such as rule-based or cost-based optimizers and distributed execution engines, work to identify the most efficient execution plan. However, these methods have limitations, particularly in complex, high-volume workloads. To address this, we propose an alternative solution: accelerating query execution through materialized views, leveraging a time-space tradeoff to improve performance.

Proposed Solution
Our approach centers around an architecture that introduces a feedback loop between query plan telemetry and workload-driven optimizations. The solution uses materialized views to precompute and store intermediate query results, which can be reused by future queries to reduce execution time. This methodology can be applied in multiple database contexts, including OpenSearch and MPP/Data Lake engines like Spark, Presto, and Trino.

Key Components of the Architecture:

  1. Workload:
    Collect telemetry data, such as execution plans, cardinality, runtime, and memory usage during query execution. These metrics will feed into the view selection process.
  2. Telemetry Preprocessing:
    Transform collected telemetry into a uniform representation to standardize data across and optimize the view selection process.
  3. Heuristic-Based View Selection:
    Analyze the workload and automatically identify sub-queries that benefit from materialization. We propose a simple heuristic algorithm that selects views based on estimated performance gains and materialization costs
  4. Materialized View Creation and Maintenance:
    Materialized views are created and maintained to ensure data consistency. These views store intermediate results (e.g., summary aggregates, sub-query results) that can be reused to speed up query execution.
    Implementation Plan:

Focus on VPC Flow Logs: We plan to use VPC flow logs as test data, inserting them into an AWS OpenSearch cluster and generate workloads to test the architecture. We'll collect telemetry by listening to plan execution and gather metrics on cardinality, time, and memory usage. This data will be used to create a workload representation.

Heuristic-Based View Selection: We aim to implement a simple, heuristic-based algorithm for view selection. This algorithm will automatically select views that offer significant performance boosts with minimal materialization costs. By focusing on simplicity, we can ensure rapid implementation and real-time optimization during query execution.

@lucasvj2 lucasvj2 added enhancement New feature or request untriaged labels Nov 11, 2024
@lucasvj2 lucasvj2 changed the title [RFC] [RFC] Accelerating Spark Workloads Nov 11, 2024
@lucasvj2 lucasvj2 changed the title [RFC] Accelerating Spark Workloads [RFC] Accelerating Spark Workloads by Materializing Views Nov 11, 2024
@lucasvj2 lucasvj2 changed the title [RFC] Accelerating Spark Workloads by Materializing Views [RFC] Accelerating Spark Workloads by With Materialized Views Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants