Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-3719][VL] Introduce VeloxIntermediateData to adjust velox agg func intermediate data #3721

Merged
merged 4 commits into from
Nov 21, 2023

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented Nov 15, 2023

What changes were proposed in this pull request?

If there are inconsistencies in the order and type of intermediate data between Velox and Spark aggregation functions, special handling is required. We can introduce a VeloxIntermediateData object to handle these cases. For non-special cases, we can continue using the aggBufferAttributes of the aggregation function without any special matching. In future PRs, the remaining methods for handling such aggregation functions will be incorporated into this VeloxIntermediateData object.

In the applyExtractStruct function, a lot of code was written to match the intermediate data outputted by Velox with the column data in Spark's agg buffer. These code segments involved many index order adjustments, which made them difficult to read and understand why such ordering was necessary. For example (It is difficult to understand the significance of [1, 4, 5, 0, 2, 3]),

case _: Corr =>
  // Select count from Velox struct with count casted from LongType into DoubleType.
  expressionNodes.add(
    ExpressionBuilder
      .makeCast(
        ConverterUtils.getTypeNode(DoubleType, nullable = false),
        ExpressionBuilder.makeSelection(colIdx, 1),
        SQLConf.get.ansiEnabled))
  expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 4))
  expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 5))
  expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 0))
  expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 2))
  expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 3))

In this PR, all these code segments have been modified and improved.

How was this patch tested?

Exists CI

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@liujiayi771 liujiayi771 changed the title [WIP][GLUTEN-3719][VL] Remove case match in getIntermediateTypeNode [GLUTEN-3719][VL] Introduce VeloxIntermediateData to adjust velox agg func intermediate data Nov 16, 2023
Copy link

#3719

@liujiayi771
Copy link
Contributor Author

@rui-mo @PHILO-HE Could you help review this PR?

@liujiayi771 liujiayi771 marked this pull request as ready for review November 16, 2023 06:10
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 2))
expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 3))
expressionNodes.add(ExpressionBuilder.makeSelection(colIdx, 0))
case _ @VeloxIntermediateData.Type(veloxTypes: Seq[DataType]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here. Why can we match aggregateFunction against VeloxIntermediateData.Type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will use VeloxIntermediateData.Type.unapply method to extract veloxTypes from aggFunc. This is equivalent to val veloxTypes = VeloxIntermediateData.Type.unapply(aggFunc)

val (sparkOrders, sparkTypes) =
aggFunc.aggBufferAttributes.map(attr => (attr.name, attr.dataType)).unzip
val veloxOrders = VeloxIntermediateData.veloxIntermediateDataOrder(aggFunc)
val adjustedOrders = sparkOrders.map(veloxOrders.indexOf(_))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough to decide the order based on name equality? E.g., if attr.name contains suffix of exprId, would it fail to match with the string in veloxOrders?

Copy link
Contributor Author

@liujiayi771 liujiayi771 Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -133,56 +123,29 @@ case class HashAggregateExecTransformer(
case _ =>
throw new UnsupportedOperationException(s"${expr.mode} not supported.")
}
val aggFunc = expr.aggregateFunction
expr.aggregateFunction match {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use aggFunc defined in the previous line.

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing these details. Will merge after internal CI passes.

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@rui-mo rui-mo merged commit f29077e into apache:main Nov 21, 2023
17 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_3721_time.csv log/native_master_11_20_2023_60fc2a02e_time.csv difference percentage
q1 33.26 34.37 1.114 103.35%
q2 24.72 24.70 -0.017 99.93%
q3 37.66 35.40 -2.258 94.00%
q4 36.77 36.40 -0.378 98.97%
q5 70.70 68.86 -1.838 97.40%
q6 6.80 7.01 0.209 103.07%
q7 83.25 85.39 2.142 102.57%
q8 86.21 87.44 1.224 101.42%
q9 124.48 124.93 0.452 100.36%
q10 44.95 46.83 1.877 104.18%
q11 19.35 19.99 0.643 103.32%
q12 24.96 24.57 -0.390 98.44%
q13 46.07 44.85 -1.221 97.35%
q14 15.31 18.77 3.466 122.65%
q15 27.70 27.13 -0.566 97.96%
q16 15.51 15.41 -0.096 99.38%
q17 100.30 102.21 1.913 101.91%
q18 145.89 149.83 3.943 102.70%
q19 13.02 13.48 0.460 103.54%
q20 27.35 28.10 0.756 102.76%
q21 221.79 220.09 -1.694 99.24%
q22 13.23 12.89 -0.341 97.42%
total 1219.25 1228.65 9.399 100.77%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants