Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-28866: [Java] Java Dataset API ScanOptions expansion #41646

Merged
merged 6 commits into from
Jul 30, 2024

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented May 14, 2024

Rationale for this change

What changes are included in this PR?

Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types
And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions

Are these changes tested?

new added UT.

Are there any user-facing changes?

No.

@jinchengchenghh jinchengchenghh marked this pull request as draft May 14, 2024 08:15
Copy link

⚠️ GitHub issue #28866 has been automatically assigned in GitHub to PR creator.

@jinchengchenghh
Copy link
Contributor Author

Can you help review this PR? If the framework is OK, I will add more common config in this PR. Thanks! @westonpace

*/
public ByteBuffer serialize() {
Map<String, String> options = Stream.concat(Stream.concat(readOptions.entrySet().stream(),
parseOptions.entrySet().stream()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert all the options to a map because it is a easy implement, and now we don't have same option name in CPP parse_options and read_options, but to further extend, we may need to serialize more accurately. I'm open to here if you think we should serialize each option

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe having a better serialize option for each would be better. But I see your point, maybe we could do it in a follow up PR.

cc @lidavidm @westonpace

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 16, 2024
@lidavidm
Copy link
Member

CC @vibhatha

@@ -108,7 +108,7 @@ ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=f989a862f694e7dbb695925ddb7c4ce06aa6c51aca
ARROW_S2N_TLS_BUILD_VERSION=v1.3.35
ARROW_S2N_TLS_BUILD_SHA256_CHECKSUM=9d32b26e6bfcc058d98248bf8fc231537e347395dd89cf62bb432b55c5da990d
ARROW_THRIFT_BUILD_VERSION=0.16.0
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=f460b5c1ca30d8918ff95ea3eb6291b3951cf518553566088f3f2be8981f6209
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=df2931de646a366c2e5962af679018bca2395d586e00ba82d09c0379f14f8e7b
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Occasional change, for my local environment, will remove it

column_types[field->name()] = field->type();
}
} else {
return Status::Invalid("Not support this config " + it.first);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
return Status::Invalid("Not support this config " + it.first);
return Status::Invalid("Config " + it.first + " is not supported.");

}

if (!literal.has_map()) {
return Status::Invalid("Literal does not have map");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return Status::Invalid("Literal does not have map");
return Status::Invalid("Literal does not have a map");

#endif
default:
std::string error_message =
"illegal file format id: " + std::to_string(file_format_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
"illegal file format id: " + std::to_string(file_format_id);
"Illegal file format id: " + std::to_string(file_format_id);

* @param config config map
* @return bufer to jni call argument, should be DirectByteBuffer
*/
default ByteBuffer serializeMap(Map<String, String> config) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function just written to pass a Java Map to C++ via JNI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put it as a private static helper somewhere? No need to expose it publicly as an instance method

} else if (key == "quoting") {
options->parse_options.quoting = parseBool(value);
} else if (key == "column_type") {
int64_t schema_address = std::stol(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check for possible -1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in Java side to not add invalid schema address


import io.substrait.proto.Expression;

public class StringMapNode implements Serializable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at the functionality, I think what we have here is a util class which converts a particular map config to a particular Substrait protobuf message. Since this can be used in other cases, it could come under substrait.util package. And the toProtobuf could be mapToExpressionLiteral() ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have doubts about having a separate class for this purpose though.

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added a few comments. But I am going to go through the content once more.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 22, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jul 29, 2024
@lidavidm
Copy link
Member

@github-actions crossbow submit java-jars

Copy link

Revision: 04e4390

Submitted crossbow builds: ursacomputing/crossbow @ actions-4da86e64a2

Task Status
java-jars GitHub Actions

@lidavidm
Copy link
Member

Sorry, do you mind rebasing again so we can validate the JNI job?

@lidavidm
Copy link
Member

@github-actions crossbow submit java-jars

Copy link

Revision: 98a3beb

Submitted crossbow builds: ursacomputing/crossbow @ actions-f053a0e3dd

Task Status
java-jars GitHub Actions

@lidavidm
Copy link
Member

Sigh, well, another update broke that pipeline again..

@vibhatha
Copy link
Collaborator

Sigh, well, another update broke that pipeline again..

Hmm, seems like that. I will take a look.

@lidavidm lidavidm merged commit fd69e5e into apache:main Jul 30, 2024
14 checks passed
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Jul 30, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit fd69e5e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 24 possible false positives for unstable benchmarks that are known to sometimes produce them.

@jinchengchenghh
Copy link
Contributor Author

I find arrow-dataset tests don't run in github Action, is it right? only arrow-c-data test runs.
because environment ARROW_DATASET not set. @lidavidm @vibhatha
https://github.com/apache/arrow/blob/main/ci/scripts/java_test.sh#L41
https://github.com/apache/arrow/blob/main/ci/docker/java-jni-manylinux-201x.dockerfile#L53

@vibhatha
Copy link
Collaborator

@lidavidm this seems to be wrong 🤔

@lidavidm
Copy link
Member

Can you make a PR?

@vibhatha
Copy link
Collaborator

Sure I can.

@jinchengchenghh
Copy link
Contributor Author

I create a issue #43502 @vibhatha
CC @lidavidm

@vibhatha
Copy link
Collaborator

vibhatha commented Aug 1, 2024

Thanks @jinchengchenghh

@vibhatha
Copy link
Collaborator

vibhatha commented Aug 1, 2024

I have created a PR: #43503

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants