Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SL-1756] [SL-1755] [Feature] Use CTEs instead of sub queries in generated SQL #1040

Open
3 tasks done
Jstein77 opened this issue Feb 21, 2024 · 1 comment
Open
3 tasks done
Labels
enhancement New feature or request linear triage Tasks that need to be triaged

Comments

@Jstein77
Copy link
Contributor

Jstein77 commented Feb 21, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing metricflow functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently, we render components of the data flow plan as sub-queries. For example, to query a simple metric called revenue, we'll first create an inner query to select the measure and dimension columns, then create an outer query to perform the metric aggregation.

  SELECT
    metric_time__day
    , SUM(revenue) AS revenue
  FROM (
    SELECT
      DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
      , product_price AS revenue
    FROM ANALYTICS.dbt_jstein.order_items order_item_src_78
  ) subq_119
  GROUP BY
    metric_time__day

This SQL can get more complex when additional joins are needed - for example, to include a dimension, when multiple metrics are requested in a query, or when constructing derived metrics. Issue #422 has an example of a derived metric query with additional layers of nesting. We could instead use CTEs in the SQL to calculate revenue:

with subq_119 as (
     SELECT
      DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
      , product_price AS revenue
    FROM ANALYTICS.dbt_jstein.order_items order_item_src_78
  ) subq_119

SELECT
  metric_time__day
  , COALESCE(revenue, 0) AS revenue
FROM subq_119
  GROUP BY
    metric_time__day

This provides a few key benefits.

  1. Increases legibility of the SQL generated, especially for complex cases
  2. Can improve the efficiency of the query for derived metrics or queries with multiple metrics. For example, in Issue Derived & component metrics in same mf query results in complex/unnecessary joins #422, we could render the following SQL which would avoid multiple table scans of order_items:
(
with base_metrics as (
  SELECT
    DATE_TRUNC('month', ordered_at) AS metric_time__month
    , location_name
    , SUM(order_item_total) AS revenue
    , SUM(product_cost) AS costs
    , SUM(order_item_tax_paid) AS taxes
  FROM dbt_test.dbt_metrics.order_items order_items_src_0
  GROUP BY
    DATE_TRUNC('month', ordered_at)
    , location_name
)

select
 revenue
, costs
, taxes
revenue - (costs + taxes) AS net_profit
from base_metrics

Risks:

  • We would need to test how the query optimizers on different data platforms handle CTEs. There could be cases where we see query performance decline. However, in the case of derived metrics and multiple metrics queries, we should see performance gains.

Describe alternatives you've considered

Keep the current behavior. This comes with a real performance hit for companies querying multiple metrics and derived metrics. We will also be generating less legible SQL.

Who will this benefit?

This will make it easier for analytics engineers to reason about the SQL we're generating, and improve performance for complex queries.

Are you interested in contributing this feature?

No response

Anything else?

No response

SL-1755

SL-1756

@Jstein77 Jstein77 added enhancement New feature or request triage Tasks that need to be triaged labels Feb 21, 2024
@Jstein77 Jstein77 changed the title [Feature] Use CTEs instead of sub queries in generated SQL [SL-1755] [Feature] Use CTEs instead of sub queries in generated SQL Feb 21, 2024
@Jstein77 Jstein77 changed the title [SL-1755] [Feature] Use CTEs instead of sub queries in generated SQL [SL-1756] [SL-1755] [Feature] Use CTEs instead of sub queries in generated SQL Feb 21, 2024
@soham-dasgupta
Copy link

soham-dasgupta commented Sep 26, 2024

This looks to be a feasible solution to the problem of generating complicated FULL OUTER JOIN queries. Do we have eyes on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request linear triage Tasks that need to be triaged
Projects
None yet
Development

No branches or pull requests

2 participants