-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-28866: [Java] Java Dataset API ScanOptions expansion #41646
Conversation
|
Can you help review this PR? If the framework is OK, I will add more common config in this PR. Thanks! @westonpace |
*/ | ||
public ByteBuffer serialize() { | ||
Map<String, String> options = Stream.concat(Stream.concat(readOptions.entrySet().stream(), | ||
parseOptions.entrySet().stream()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert all the options to a map because it is a easy implement, and now we don't have same option name in CPP parse_options and read_options, but to further extend, we may need to serialize more accurately. I'm open to here if you think we should serialize each option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe having a better serialize option for each would be better. But I see your point, maybe we could do it in a follow up PR.
CC @vibhatha |
cpp/thirdparty/versions.txt
Outdated
@@ -108,7 +108,7 @@ ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=f989a862f694e7dbb695925ddb7c4ce06aa6c51aca | |||
ARROW_S2N_TLS_BUILD_VERSION=v1.3.35 | |||
ARROW_S2N_TLS_BUILD_SHA256_CHECKSUM=9d32b26e6bfcc058d98248bf8fc231537e347395dd89cf62bb432b55c5da990d | |||
ARROW_THRIFT_BUILD_VERSION=0.16.0 | |||
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=f460b5c1ca30d8918ff95ea3eb6291b3951cf518553566088f3f2be8981f6209 | |||
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=df2931de646a366c2e5962af679018bca2395d586e00ba82d09c0379f14f8e7b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Occasional change, for my local environment, will remove it
cpp/src/arrow/dataset/file_csv.cc
Outdated
column_types[field->name()] = field->type(); | ||
} | ||
} else { | ||
return Status::Invalid("Not support this config " + it.first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
return Status::Invalid("Not support this config " + it.first); | |
return Status::Invalid("Config " + it.first + " is not supported."); |
} | ||
|
||
if (!literal.has_map()) { | ||
return Status::Invalid("Literal does not have map"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
return Status::Invalid("Literal does not have map"); | |
return Status::Invalid("Literal does not have a map"); |
#endif | ||
default: | ||
std::string error_message = | ||
"illegal file format id: " + std::to_string(file_format_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
"illegal file format id: " + std::to_string(file_format_id); | |
"Illegal file format id: " + std::to_string(file_format_id); |
* @param config config map | ||
* @return bufer to jni call argument, should be DirectByteBuffer | ||
*/ | ||
default ByteBuffer serializeMap(Map<String, String> config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function just written to pass a Java Map to C++ via JNI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put it as a private static helper somewhere? No need to expose it publicly as an instance method
cpp/src/arrow/dataset/file_csv.cc
Outdated
} else if (key == "quoting") { | ||
options->parse_options.quoting = parseBool(value); | ||
} else if (key == "column_type") { | ||
int64_t schema_address = std::stol(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check for possible -1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it in Java side to not add invalid schema address
|
||
import io.substrait.proto.Expression; | ||
|
||
public class StringMapNode implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at the functionality, I think what we have here is a util class which converts a particular map config to a particular Substrait protobuf message. Since this can be used in other cases, it could come under substrait.util
package. And the toProtobuf
could be mapToExpressionLiteral()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have doubts about having a separate class for this purpose though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added a few comments. But I am going to go through the content once more.
@github-actions crossbow submit java-jars |
Revision: 04e4390 Submitted crossbow builds: ursacomputing/crossbow @ actions-4da86e64a2
|
Sorry, do you mind rebasing again so we can validate the JNI job? |
@github-actions crossbow submit java-jars |
Revision: 98a3beb Submitted crossbow builds: ursacomputing/crossbow @ actions-f053a0e3dd
|
Sigh, well, another update broke that pipeline again.. |
Hmm, seems like that. I will take a look. |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit fd69e5e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 24 possible false positives for unstable benchmarks that are known to sometimes produce them. |
I find arrow-dataset tests don't run in github Action, is it right? only arrow-c-data test runs. |
Can you make a PR? |
Sure I can. |
Thanks @jinchengchenghh |
I have created a PR: #43503 |
Rationale for this change
What changes are included in this PR?
Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types
And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions
Are these changes tested?
new added UT.
Are there any user-facing changes?
No.