Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-5826] Add FIND_IN_SET function (enabled in Hive and Spark libraries) #3317

Merged
merged 1 commit into from
Nov 1, 2023

Conversation

herunkang2018
Copy link
Contributor

No description provided.

@sonarcloud
Copy link

sonarcloud bot commented Jul 14, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

98.0% 98.0% Coverage
0.0% 0.0% Duplication

@@ -1076,6 +1076,32 @@ public static int levenshtein(String string1, String string2) {
return LEVENSHTEIN_DISTANCE.apply(string1, string2);
}

/** SQL FIND_IN_SET(string, stringArray) function. */
public static int findInSet(String matchStr, String textStr) {
if (matchStr.contains(",")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you add this check? Could you give the code comments to explain?

Copy link
Contributor Author

@herunkang2018 herunkang2018 Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is from the function specification, which refers to Spark's doc. I will add function annotation to address the meaning of this function.

@@ -1076,6 +1076,32 @@ public static int levenshtein(String string1, String string2) {
return LEVENSHTEIN_DISTANCE.apply(string1, string2);
}

/** SQL FIND_IN_SET(string, stringArray) function. */
public static int findInSet(String matchStr, String textStr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this method is public, I think we should handle the cases where matchStr and textStr are null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For null check issue, we can define the null policy of the function, and null policy will handle null properly in code generation phase, so I think there is no need to implement null check in each function implementation additionally.

Copy link
Contributor

@JiajunBernoulli JiajunBernoulli Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @LakeShen.

Check null for public method always valuable.

And your unit test need have null parameter case. @herunkang2018

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the null check. Thanks for the review.

if (matchStr.contains(",")) {
return 0;
}
int textStrLen = textStr.length();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final int textStrLen = textStr.length();

Copy link
Contributor Author

@herunkang2018 herunkang2018 Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, I will improve it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

return 0;
}
int textStrLen = textStr.length();
int matchStrLen = matchStr.length();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final int matchStrLen = matchStr.length();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

int textStrLen = textStr.length();
int matchStrLen = matchStr.length();
int n = 1;
int lastComma = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastComma -> lastCommaIndex or lastCommaPosition maybe better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

int n = 1;
int lastComma = -1;
for (int i = 0; i < textStrLen; i++) {
if (textStr.charAt(i) == ',') {
Copy link
Contributor

@LakeShen LakeShen Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now ',' is hard coding,are there some character constants that represent this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Updated.

}
}
if (textStrLen - (lastComma + 1) == matchStrLen
&& textStr.substring(lastComma + 1, textStrLen).equals(matchStr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are more code comments would help people understand better,I think you could give more code comments in this method:)

Copy link
Contributor Author

@herunkang2018 herunkang2018 Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function annotation is added.

@LakeShen
Copy link
Contributor

@herunkang2018 ,thank you for opening this PR, I left a couple of comments,overall looks good.

@herunkang2018
Copy link
Contributor Author

@LakeShen Thanks a lot for the review. All comments resolved, please help to review again if you have time.

@LakeShen
Copy link
Contributor

LakeShen commented Oct 2, 2023

@herunkang2018 ,LGTM.

And need the second people to review it:)

@@ -2729,6 +2729,7 @@ BigQuery's type system uses confusingly different names for types and functions:
| o | EXTRACT(xml, xpath, [, namespaces ]) | Returns the XML fragment of the element or elements matched by the XPath expression. The optional namespace value that specifies a default mapping or namespace mapping for prefixes, which is used when evaluating the XPath expression
| o | EXISTSNODE(xml, xpath, [, namespaces ]) | Determines whether traversal of a XML document using a specified xpath results in any nodes. Returns 0 if no nodes remain after applying the XPath traversal on the document fragment of the element or elements matched by the XPath expression. Returns 1 if any nodes remain. The optional namespace value that specifies a default mapping or namespace mapping for prefixes, which is used when evaluating the XPath expression.
| m | EXTRACTVALUE(xml, xpathExpr)) | Returns the text of the first text node which is a child of the element or elements matched by the XPath expression.
| h s | FIND_IN_SET(string, stringArray) | Returns the index (1-based) of the given *string* in the comma-delimited *stringArray*. Returns 0, if the given *string* was not found or if *string* contains a comma
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this description I infer that stringArray is of type ARRAY STRING, however it seems to be just a string which may contain commas. (I also think that the function name is not a very good choice, but that cannot be changed.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good advice, the parameter name can be improved to a more concise one, like matchStr and textStr, to avoid ambiguity.

Copy link
Contributor Author

@herunkang2018 herunkang2018 Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add an example to clarify it, like FIND_IN_SET('bc', 'a,bc,def') returns 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -2729,6 +2729,7 @@ BigQuery's type system uses confusingly different names for types and functions:
| o | EXTRACT(xml, xpath, [, namespaces ]) | Returns the XML fragment of the element or elements matched by the XPath expression. The optional namespace value that specifies a default mapping or namespace mapping for prefixes, which is used when evaluating the XPath expression
| o | EXISTSNODE(xml, xpath, [, namespaces ]) | Determines whether traversal of a XML document using a specified xpath results in any nodes. Returns 0 if no nodes remain after applying the XPath traversal on the document fragment of the element or elements matched by the XPath expression. Returns 1 if any nodes remain. The optional namespace value that specifies a default mapping or namespace mapping for prefixes, which is used when evaluating the XPath expression.
| m | EXTRACTVALUE(xml, xpathExpr)) | Returns the text of the first text node which is a child of the element or elements matched by the XPath expression.
| h s | FIND_IN_SET(string, stringArray) | Returns the index (1-based) of the given *string* in the comma-delimited *stringArray*. Returns 0, if the given *string* was not found or if *string* contains a comma. For example, FIND_IN_SET('bc', 'a,bc,def') returns 2
Copy link
Contributor

@JiajunBernoulli JiajunBernoulli Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*stringArray* is string not array. Perhaps it's confusing.

How about *strings*, *mkStrings* or *joinStrings*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. Updated.

@herunkang2018 herunkang2018 changed the title [CALCITE-5826] Add FIND_IN_SET function (enabled in Hive and Spark library) [CALCITE-5826] Add FIND_IN_SET function (enabled in Hive and Spark libraries) Oct 29, 2023
int n = 1;
int lastCommaIndex = -1;
for (int i = 0; i < textStrLen; i++) {
if (textStr.charAt(i) == COMMA_DELIMITER) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use textStr.split(",") ?
It will be easier to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. Current implementation is consistent with Spark's. For better code readability, change to use split is good, but some performance may be reduced since for loop can early exit when matched and split cannot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@sonarcloud
Copy link

sonarcloud bot commented Oct 31, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

92.1% 92.1% Coverage
0.0% 0.0% Duplication

@libenchao
Copy link
Member

This PR has received 1 approval from @mihaibudiu . Do you have more concerns @LakeShen @JiajunBernoulli ? Let's try to get it in 1.36.0

@LakeShen
Copy link
Contributor

LakeShen commented Nov 1, 2023

This PR has received 1 approval from @mihaibudiu . Do you have more concerns @LakeShen @JiajunBernoulli ? Let's try to get it in 1.36.0

LGTM.

@JiajunBernoulli JiajunBernoulli merged commit 2ddc605 into apache:main Nov 1, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants