Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7749][VL] Trim ISOControl characters in string for casting to integral type #7806

Merged
merged 7 commits into from
Nov 6, 2024

Conversation

wForget
Copy link
Member

@wForget wForget commented Nov 4, 2024

What changes were proposed in this pull request?

Fixes: #7749, to align with Spark's change introduced by apache/spark#41535.

How was this patch tested?

added unit test

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Nov 4, 2024
Copy link

github-actions bot commented Nov 4, 2024

#7749

Copy link

github-actions bot commented Nov 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Nov 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Nov 4, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Nov 5, 2024

Run Gluten Clickhouse CI

@PHILO-HE PHILO-HE self-requested a review November 5, 2024 07:08
Copy link

github-actions bot commented Nov 5, 2024

Run Gluten Clickhouse CI on x86

1 similar comment
@wForget
Copy link
Member Author

wForget commented Nov 5, 2024

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@jackylee-ch jackylee-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically look good to me. @PHILO-HE any more comments?

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wForget, thanks for your fix! Could you give me some details about where these control chars are skipped when casting to integral types in Spark? I only found white space characters are checked and skipped.

@@ -723,30 +723,29 @@ class VeloxSparkPlanExecApi extends SparkPlanExecApi {
val trimParaSepStr = "\u2029"
// Needs to be trimmed for casting to float/double/decimal
val trimSpaceStr = ('\u0000' to '\u0020').toList.mkString
// ISOControl characters, refer java.lang.Character.isISOControl(int)
val isoControlControlStr = (('\u0000' to '\u001F') ++ ('\u007F' to '\u009F')).toList.mkString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isoControlControlStr->isoControlStr, a typo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

@wForget
Copy link
Member Author

wForget commented Nov 6, 2024

@wForget, thanks for your fix! Could you give me some details about where these control chars are skipped when casting to integral types in Spark? I only found white space characters are checked and skipped.

Introduced from #41535, the relevant calls are as follows:

https://github.com/apache/spark/blob/36410f073cb978fc504f85fb25b4942dac10db3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L785

https://github.com/apache/spark/blob/36410f073cb978fc504f85fb25b4942dac10db3f/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L1617

Copy link

github-actions bot commented Nov 6, 2024

Run Gluten Clickhouse CI on x86

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Nov 6, 2024

@wForget, I see. What I checked is Spark-3.3.1, which doesn't cover that change.

Let's note this is not applicable to all supported Spark versions. But I think it may be acceptable to end users.

@PHILO-HE PHILO-HE merged commit 3099799 into apache:main Nov 6, 2024
46 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL][1.2][Result mismatch] Cast string to integral type does not ignore ISO control characters
3 participants