Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-3378][CORE] Datasource V2 data lake read support #3843

Merged
merged 5 commits into from
Dec 5, 2023

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented Nov 24, 2023

What changes were proposed in this pull request?

  • Implement datasource v2 data lake read based on [VL] Unified design for data lake read support in Gluten + Velox #3378 .
  • Introduce gluten-iceberg module, IcebergScanTransformer is defined that extends BatchScanExecTransformer. Although the IcebergLocalFilesNode is currently the same as the LocalFilesNode, it is being prepared for future implementation of the "delete file" functionality.
  • The ScanTransformerFactory is used to construct various types of ScanTransformer. For Iceberg, the IcebergScanTransformer will be constructed using service loader in order to avoid the dependency of gluten-core on gluten-iceberg.
  • The logic that pushes down the Filter conditions to the BatchScan runtimeFilters has been removed from the original code. This logic was unnecessary because BatchScan only supports DPP's runtime filter for partition filtering.
  • The logic for executing subqueries has been placed in the transformDynamicPruningExpr method. This allows both BatchScan and FileSourceScan to share this logic, as it is also required by BatchScan.

How was this patch tested?

  • Add VeloxTPCHIcebergSuite and VeloxIcebergSuite.
  • The modifications to the interfaces can be tested using the existing CI.

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

4 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

@YannByron @yma11 @rui-mo Could you help review?

Copy link

Run Gluten Clickhouse CI

}
val scan = batchScanExec.scan
scan match {
case _ if scan.getClass.getName == IcebergScanClassName =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, supporting a new Scan type (e.g. adding a non-Iceberg data source) will require modifying this match code, as well as the supportedBatchScan() method below.
Would it please be possible to allow other plugins to register themselves, so that adding a new format will not require changing ScanTransformerFactory code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rz-vastdata We can achieve this using the Service Loader, similar to DataSourceRegister in Spark. However, the current ScanTransformer in Gluten extends Spark's ScanExec and requires a constructor with parameters. This makes it a bit difficult to handle with Service Loader. I will think about whether there is another way to use the Service Loader.


public class IcebergLocalFilesNode extends LocalFilesNode {

class DeleteFile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's used by this PR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use it. This is a TODO.

@YannByron
Copy link
Contributor

My fault in design. I noticed that there are many modification that rename XXXExecTransformer to XXXTransformer. Maybe we can use XXXExecTransformer as these class names directly to avoid it.

@yma11
Copy link
Contributor

yma11 commented Nov 27, 2023

My fault in design. I noticed that there are many modification that rename XXXExecTransformer to XXXTransformer. Maybe we can use XXXExecTransformer as these class names directly to avoid it.

+1

import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.sources.BaseRelation

trait DatasourceScanTransformer extends BaseScanTransformer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this added to just keep consistent with vanilla Spark? If so, I think we can deduct this hierarchy to make the inheritance not so complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta will extends this trait.

@@ -52,7 +52,7 @@ class HiveTableScanExecTransformer(
relation: HiveTableRelation,
partitionPruningPred: Seq[Expression])(session: SparkSession)
extends HiveTableScanExec(requestedAttributes, relation, partitionPruningPred)(session)
with BasicScanExecTransformer {
with BaseScanTransformer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to still use BasicScanExecTransformer as this change is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will revert these changes first.

import org.apache.spark.sql.connector.read.InputPartition
import org.apache.spark.sql.types.StructType

trait BaseDataSource extends SupportFormat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems BasicScanTransformer is the only implementation of SupportFormat, maybe you can add the fields directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this interface to BaseDataSource?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean SupportFormat.


// TODO: Add delete file support for MOR iceberg table

IcebergLocalFilesNode(
Copy link
Contributor

@yma11 yma11 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we also need to use service loader for this LocalFilesNode serialization? I am not sure how much in common for IcebergLocalFilesNode and DeltaLocalFilesNode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It is not used by gluten-core. The incremental part of the delta and iceberg definitions of the MOR table is expected to have significant differences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a toProtobuf() method for IcebergLocalFilesNode and it will be called in gluten-core/substrait/../PlanNode for serialization. Or how do you plan to pass these new added fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PR does not involve deleting files. To support deleting files in the future, we will need to implement a specific toProtobuf method and modify the proto file.

Copy link

Run Gluten Clickhouse CI

@@ -70,11 +70,11 @@ class HiveTableScanExecTransformer(

override def getPartitions: Seq[InputPartition] = partitions

override def getPartitionSchemas: StructType = relation.tableMeta.partitionSchema
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yma11 I think the names of these interfaces need to be changed. Here, you can refer to the corresponding interfaces in Spark, and we should not use plurals.

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

@yma11 @YannByron @rz-vastdata I have modified the reflection to use the service loader. Can you please help review?

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@liujiayi771 liujiayi771 marked this pull request as ready for review November 29, 2023 12:22
* their data source v2 transformer. This allows users to give the data source v2 transformer alias
* as the format type over the fully qualified class name.
*/
trait DataSourceV2TransformerRegister {
Copy link
Contributor

@YannByron YannByron Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it's ok. When support other datasource based on v1, i'll update here.

Copy link
Contributor Author

@liujiayi771 liujiayi771 Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. DataSource V1 has different interface parameters.


private val dataSourceV2TransformerMap = new ConcurrentHashMap[String, Class[_]]()

def createFileSourceScanTransformer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine the two createFileSourceScanTransformer methods ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, initially they were combined together, but we need to add some optional "Option" parameters.

Copy link

Run Gluten Clickhouse CI

@liujiayi771 liujiayi771 changed the title [WIP][GLUTEN-3378][CORE] Datasource V2 data lake read support [GLUTEN-3378][CORE] Datasource V2 data lake read support Nov 30, 2023
Copy link

#3378

Copy link

Run Gluten Clickhouse CI

@rui-mo rui-mo requested a review from zzcclp December 1, 2023 03:02
ScanTransformerFactory.createFileSourceScanTransformer(
fileSourceScan,
reuseSubquery,
extraFilters = leftFilters)
case batchScan: BatchScanExec =>
Copy link
Contributor

@rui-mo rui-mo Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your fix. Can we avoid the extra filter pushdown for BatchScan by not calling applyFilterPushdownToScan for it?

https://github.com/oap-project/gluten/blob/c531abd94045db71a8f8ef692e5c5a80cbcd118f/gluten-core/src/main/scala/io/glutenproject/extension/ColumnarOverrides.scala#L140-L145

Copy link
Contributor Author

@liujiayi771 liujiayi771 Dec 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this in d879fd0.

Copy link

github-actions bot commented Dec 1, 2023

Run Gluten Clickhouse CI

Copy link
Contributor

@yma11 yma11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liujiayi771, thanks for your update!

@yma11 yma11 merged commit a462434 into apache:main Dec 5, 2023
17 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_master_12_05_2023_time.csv log/native_master_12_04_2023_94c91c55d_time.csv difference percentage
q1 34.72 34.65 -0.068 99.80%
q2 25.00 24.91 -0.099 99.61%
q3 37.98 36.38 -1.594 95.80%
q4 38.25 37.37 -0.886 97.69%
q5 72.06 72.63 0.567 100.79%
q6 5.37 6.87 1.504 128.03%
q7 82.30 85.44 3.137 103.81%
q8 86.96 87.86 0.910 101.05%
q9 126.92 124.52 -2.405 98.10%
q10 45.43 46.04 0.613 101.35%
q11 20.31 20.12 -0.183 99.10%
q12 27.02 26.71 -0.313 98.84%
q13 47.15 46.45 -0.698 98.52%
q14 19.01 14.55 -4.464 76.52%
q15 29.59 28.16 -1.428 95.17%
q16 15.81 15.75 -0.059 99.63%
q17 103.87 103.10 -0.768 99.26%
q18 150.70 150.63 -0.066 99.96%
q19 14.53 12.90 -1.626 88.80%
q20 28.18 27.73 -0.452 98.40%
q21 223.72 222.80 -0.925 99.59%
q22 13.17 13.06 -0.113 99.14%
total 1248.05 1238.63 -9.417 99.25%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants