-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CALCITE-5826] Add FIND_IN_SET function (enabled in Hive and Spark libraries) #3317
Conversation
Kudos, SonarCloud Quality Gate passed! |
15006fe
to
1866952
Compare
@@ -1076,6 +1076,32 @@ public static int levenshtein(String string1, String string2) { | |||
return LEVENSHTEIN_DISTANCE.apply(string1, string2); | |||
} | |||
|
|||
/** SQL FIND_IN_SET(string, stringArray) function. */ | |||
public static int findInSet(String matchStr, String textStr) { | |||
if (matchStr.contains(",")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you add this check? Could you give the code comments to explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is from the function specification, which refers to Spark's doc. I will add function annotation to address the meaning of this function.
@@ -1076,6 +1076,32 @@ public static int levenshtein(String string1, String string2) { | |||
return LEVENSHTEIN_DISTANCE.apply(string1, string2); | |||
} | |||
|
|||
/** SQL FIND_IN_SET(string, stringArray) function. */ | |||
public static int findInSet(String matchStr, String textStr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this method is public, I think we should handle the cases where matchStr and textStr are null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For null check issue, we can define the null policy of the function, and null policy will handle null properly in code generation phase, so I think there is no need to implement null check in each function implementation additionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @LakeShen.
Check null for public method always valuable.
And your unit test need have null parameter case. @herunkang2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the null check. Thanks for the review.
if (matchStr.contains(",")) { | ||
return 0; | ||
} | ||
int textStrLen = textStr.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final int textStrLen = textStr.length();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this, I will improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
return 0; | ||
} | ||
int textStrLen = textStr.length(); | ||
int matchStrLen = matchStr.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final int matchStrLen = matchStr.length();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
int textStrLen = textStr.length(); | ||
int matchStrLen = matchStr.length(); | ||
int n = 1; | ||
int lastComma = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastComma -> lastCommaIndex or lastCommaPosition maybe better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
int n = 1; | ||
int lastComma = -1; | ||
for (int i = 0; i < textStrLen; i++) { | ||
if (textStr.charAt(i) == ',') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now ',' is hard coding,are there some character constants that represent this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. Updated.
} | ||
} | ||
if (textStrLen - (lastComma + 1) == matchStrLen | ||
&& textStr.substring(lastComma + 1, textStrLen).equals(matchStr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are more code comments would help people understand better,I think you could give more code comments in this method:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function annotation is added.
@herunkang2018 ,thank you for opening this PR, I left a couple of comments,overall looks good. |
@LakeShen Thanks a lot for the review. All comments resolved, please help to review again if you have time. |
@herunkang2018 ,LGTM. And need the second people to review it:) |
site/_docs/reference.md
Outdated
@@ -2729,6 +2729,7 @@ BigQuery's type system uses confusingly different names for types and functions: | |||
| o | EXTRACT(xml, xpath, [, namespaces ]) | Returns the XML fragment of the element or elements matched by the XPath expression. The optional namespace value that specifies a default mapping or namespace mapping for prefixes, which is used when evaluating the XPath expression | |||
| o | EXISTSNODE(xml, xpath, [, namespaces ]) | Determines whether traversal of a XML document using a specified xpath results in any nodes. Returns 0 if no nodes remain after applying the XPath traversal on the document fragment of the element or elements matched by the XPath expression. Returns 1 if any nodes remain. The optional namespace value that specifies a default mapping or namespace mapping for prefixes, which is used when evaluating the XPath expression. | |||
| m | EXTRACTVALUE(xml, xpathExpr)) | Returns the text of the first text node which is a child of the element or elements matched by the XPath expression. | |||
| h s | FIND_IN_SET(string, stringArray) | Returns the index (1-based) of the given *string* in the comma-delimited *stringArray*. Returns 0, if the given *string* was not found or if *string* contains a comma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this description I infer that stringArray is of type ARRAY STRING
, however it seems to be just a string which may contain commas. (I also think that the function name is not a very good choice, but that cannot be changed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good advice, the parameter name can be improved to a more concise one, like matchStr
and textStr
, to avoid ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add an example to clarify it, like FIND_IN_SET('bc', 'a,bc,def')
returns 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
site/_docs/reference.md
Outdated
@@ -2729,6 +2729,7 @@ BigQuery's type system uses confusingly different names for types and functions: | |||
| o | EXTRACT(xml, xpath, [, namespaces ]) | Returns the XML fragment of the element or elements matched by the XPath expression. The optional namespace value that specifies a default mapping or namespace mapping for prefixes, which is used when evaluating the XPath expression | |||
| o | EXISTSNODE(xml, xpath, [, namespaces ]) | Determines whether traversal of a XML document using a specified xpath results in any nodes. Returns 0 if no nodes remain after applying the XPath traversal on the document fragment of the element or elements matched by the XPath expression. Returns 1 if any nodes remain. The optional namespace value that specifies a default mapping or namespace mapping for prefixes, which is used when evaluating the XPath expression. | |||
| m | EXTRACTVALUE(xml, xpathExpr)) | Returns the text of the first text node which is a child of the element or elements matched by the XPath expression. | |||
| h s | FIND_IN_SET(string, stringArray) | Returns the index (1-based) of the given *string* in the comma-delimited *stringArray*. Returns 0, if the given *string* was not found or if *string* contains a comma. For example, FIND_IN_SET('bc', 'a,bc,def') returns 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*stringArray*
is string not array. Perhaps it's confusing.
How about *strings*
, *mkStrings*
or *joinStrings*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice. Updated.
f01dbfe
to
c4c593d
Compare
int n = 1; | ||
int lastCommaIndex = -1; | ||
for (int i = 0; i < textStrLen; i++) { | ||
if (textStr.charAt(i) == COMMA_DELIMITER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use textStr.split(",")
?
It will be easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice. Current implementation is consistent with Spark's. For better code readability, change to use split
is good, but some performance may be reduced since for loop can early exit when matched and split
cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
c4c593d
to
e81547f
Compare
Kudos, SonarCloud Quality Gate passed! |
This PR has received 1 approval from @mihaibudiu . Do you have more concerns @LakeShen @JiajunBernoulli ? Let's try to get it in 1.36.0 |
LGTM. |
No description provided.