Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-3553][CH] Support bucket scan for ch backend #3618

Merged
merged 2 commits into from
Nov 8, 2023

Conversation

zzcclp
Copy link
Contributor

@zzcclp zzcclp commented Nov 5, 2023

What changes were proposed in this pull request?

Support bucket scan for ch backend, including parquet format and mergetree format.

Close #3553.

(Fixes: #3553)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

github-actions bot commented Nov 5, 2023

#3553

Copy link

github-actions bot commented Nov 5, 2023

Run Gluten Clickhouse CI

Comment on lines 233 to 235
// Make sure create a new read relId for the stream side first
// before the one of the build side, when there is not shuffle on the build side
(readRel, plan.output, true, substraitContext.nextRelId())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo @PHILO-HE please help to check, this change will make sure that it creates a new relId for the stream side before the build side, when the stream side is reading data from iterator and the build side is no shuffle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for an existing bug? Could you explain a bit what happens with currect logic?

Nit: when there is not shuffle on the build sidd: not => no

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a shuffle in the stream side and no shuffle in the build side, the operation orders passed to the backend are: 0: stream side read from iter, 1:build side scan, 2: build side filter, 4: join, but the orders created in the join transformer are: 0:build side scan, 1: build side filter, 2: stream side read from iter, 4: join, because stream side read from iter will be registed after creating the build side, so they are different.

Comment on lines -37 to -48
private val bucketedScan: Boolean = {
if (
relation.sparkSession.sessionState.conf.bucketingEnabled && relation.bucketSpec.isDefined
&& !disableBucketedScan
) {
val spec = relation.bucketSpec.get
val bucketColumns = spec.bucketColumnNames.flatMap(n => toAttribute(n))
bucketColumns.size == spec.bucketColumnNames.size
} else {
false
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WangGuangxin please help to review, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Nov 5, 2023

I note some code in gluten-core is changed, but CI jobs for velox backend are not triggered. It may be github action's bug that can happen for PR with many new files.

@zzcclp
Copy link
Contributor Author

zzcclp commented Nov 5, 2023

I note some code in gluten-core is changed, but CI jobs for velox backend are not triggered. It may be github action's bug that can happen for PR with many new files.

How to fix this problem, or how to trigger velox ci manually ?

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Nov 5, 2023

Copy link

github-actions bot commented Nov 5, 2023

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor Author

zzcclp commented Nov 5, 2023

Maybe, you can try to temporarily remove the paths filter:

https://github.com/oap-project/gluten/blob/16e2874502d0f794d3a79bf44875fcc2ca1c3c0d/.github/workflows/velox_be.yml#L20

It works, thanks. After passed, I will revert back

Copy link

github-actions bot commented Nov 5, 2023

Run Gluten Clickhouse CI

1 similar comment
Copy link

github-actions bot commented Nov 5, 2023

Run Gluten Clickhouse CI

Comment on lines 20 to 38
paths:
- '.github/**'
- 'pom.xml'
- 'backends-velox/**'
- 'gluten-celeborn/**'
- 'gluten-core/**'
- 'gluten-data/**'
- 'gluten-ut/**'
- 'shims/**'
- 'tools/gluten-it/**'
- 'tools/gluten-te/**'
- 'ep/build-arrow/**'
- 'ep/build-velox/**'
- 'cpp/*'
- 'cpp/CMake/**'
- 'cpp/velox/**'
- 'cpp/core/**'
- 'dev**'
# - 'substrait/substrait-spark/**'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revert after ci passed

Copy link

github-actions bot commented Nov 5, 2023

Run Gluten Clickhouse CI

@zhouyuan
Copy link
Contributor

zhouyuan commented Nov 6, 2023

May be due to the old github action checkout used: actions/checkout@v2. Will update this in following commits.

I note some code in gluten-core is changed, but CI jobs for velox backend are not triggered. It may be github action's bug that can happen for PR with many new files.

How to fix this problem, or how to trigger velox ci manually ?

Copy link

github-actions bot commented Nov 7, 2023

Run Gluten Clickhouse CI

@zzcclp
Copy link
Contributor Author

zzcclp commented Nov 7, 2023

@PHILO-HE @rui-mo please help to check why all velox ci were failed.

Copy link

github-actions bot commented Nov 8, 2023

Run Gluten Clickhouse CI

PHILO-HE
PHILO-HE previously approved these changes Nov 8, 2023
Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks this PR can increase ~70MB repo size due to the introduced binary files for test. Maybe, in the future, we can build a dedicated test repo for gluten and let it keep such binary files. cc @FelixYBW @zhouyuan
It's OK to me to merge this PR firstly. Thanks!

- 'cpp/velox/**'
- 'cpp/core/**'
- 'dev**'
# - 'substrait/substrait-spark/**'
Copy link
Contributor

@rui-mo rui-mo Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to revert these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to revert these changes?

yes, it needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Support bucket scan for ch backend, including parquet format and mergetree format.

Close apache#3553.
Copy link

github-actions bot commented Nov 8, 2023

Run Gluten Clickhouse CI

Copy link
Contributor

@baibaichen baibaichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@baibaichen baibaichen merged commit 3ec7bed into apache:main Nov 8, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Support bucket scan for ch backend
6 participants