Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Java Dataset API ScanOptions expansion #28866

Closed
asfimport opened this issue Jun 24, 2021 · 9 comments
Closed

[Java] Java Dataset API ScanOptions expansion #28866

asfimport opened this issue Jun 24, 2021 · 9 comments

Comments

@asfimport
Copy link
Collaborator

Currently, there are very few scanning options which we can set in the Java Dataset API [1].

Additionally, the options that exist now always must be set from Java, without the possibility to use sensible default values from core Arrow.

For my use-case, I want to be able to set the fragment_readahead option from the Java-side.

 

It would be great if:
 + ScanOptions.java would be expanded to allow us to set more, potentially all options related to scanner creation.
 + Java users can omit options to use the default values, e.g. [2].

It would be good to know what others think, and whether a PR for this is useful.

[1]https://github.com/apache/arrow/blob/master/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java
[2]

constexpr int64_t kDefaultBatchSize = 1 << 20;
constexpr int32_t kDefaultBatchReadahead = 32;
constexpr int32_t kDefaultFragmentReadahead = 8;

Reporter: Sebastiaan Alvarez Rodriguez

Note: This issue was originally created as ARROW-13166. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Hongze Zhang / @zhztheplayer:
Thanks Sebastiaan! As long as the option is defined in C++ side I think it would be no problem to add it to Java side. Probably we should add a builder API to Java ScanOptions when we add many new parameters.

@asfimport
Copy link
Collaborator Author

Sebastiaan Alvarez Rodriguez:
Hey, I later on heard that the scanning API is going to change. It might be easier to hold off a bit, before we find ourselves in a position were we have to reimplement stuff again.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Can you expand a bit more on what changes you think are coming up?  I think if you use the ScannerBuilder to set the options then I wouldn't expect any changes to the API coming up soon.  There is https://issues.apache.org/jira/browse/ARROW-12311 but it is more about simplifying the user API (so there aren't two ways, ScanOptions and ScannerBuilder, to do the same thing) than it is about removing or changing functionality.

@asfimport
Copy link
Collaborator Author

Sebastiaan Alvarez Rodriguez:
Oh nevermind, this change I was thinking about is actually about the syncscanner, asyncscanner, not of the scanner builder/options (https://issues.apache.org/jira/browse/ARROW-13161?focusedCommentId=17368831&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17368831)

I think it is good to go then.
I already have an implementation around that uses a builder, but don't have time to develop it further. Still, it might be a good base. Shall I open a PR with the base code?

@jinchengchenghh
Copy link
Contributor

I also have the same requirement. I use dataset scan CSV, but cannot set the parseOptions and readOptions by java ScanOptions, I find ScanBuilder can set CsvFragmentScanOptions in C++ side, so the question is how can we serialize and deserialize CsvFragmentScanOptions.

I have 2 proposals:

  1. Like the projection and filter serialization, we could use substrait ExtendedExpression::AdvancedExtension to serialize CsvFragmentScanOptions
  2. Use protobuf message to serialize directly

We can do as this PR does.
https://github.com/apache/incubator-gluten/pull/3393/files

The first proposal will be ambiguous, because user does not know how to generate the serialized binary, they should be familliar with the AdvancedExtension detail implementation, I would prefer the second.

I can help to implement this feature, can you help to confirm the proposal? @westonpace

@westonpace
Copy link
Member

Is the JNI -> dataset API using substrait today? If so, then I think 1 is preferred. However, the user should not provide AdvancedExtension, this should be taken care of internally.

For example, the user Java API for "scan CSV" could take in HashMap<String, String>. The Java implementation can convert this into a AdvancedExtension similar to the one described in incubator-gluten. Acero could be updated to detect this and configure the fragment options appropriately.

I will add, for consideration, a third proposal, which is that we could add CSV as a read type in Substrait (e.g. make a PR on the Substrait repo). There was some work started here but it was never finished.

That third option will be slower though. It would require Substrait alignment on what CSV options are common. This will take time and energy. It would be the most useful long term option.

@jinchengchenghh
Copy link
Contributor

No, now it just uses the binary serialized substrait as jni input. https://github.com/apache/arrow/blob/main/java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java#L176

HashMap is OK for most of the options, but the CsvFragmentScanOptions include other fields, I'm not sure if the ConvertOptions may have same config name with ReadOptions in the future. And ConvertOptions
has complex config value like std::unordered_map<std::string, std::shared_ptr<DataType>> column_types; it should also be serialized, but we can support the basic option by HashMap, I'm not sure if their is requirement for the complex options in the future. What do you think? @westonpace
https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/file_csv.h#L85

@westonpace
Copy link
Member

No, now it just uses the binary serialized substrait as jni input.

If it is using serialized substrait as input then I don't understand how the C++ consumer would be able to apply the CSV options onto the plan.

Would this be a second pass that transforms the plan once it is deserialized? E.g. something like:

def run_plan(substrait_plan, extra_options):
  plan = deserialize_plan(substrait_plan)
  plan = transform_plan(plan, extra_options)
  run_deserialized_plan(plan)

No, now it just uses the binary serialized substrait as jni input

I still think approach 1 is preferred. The extra options can be supplied as part of ReadRel::advanced_extension.

And ConvertOptions
has complex config value like std::unordered_map<std::string, std::shared_ptr> column_types; it should also be serialized, but we can support the basic option by HashMap, I'm not sure if their is requirement for the complex options in the future.

I think HashMap should be sufficient. Passing shared_ptr<DataType> across the JNI boundary is going to be error-prone. If we need advanced configuration like that then it should be done with a more complicated Substrait mesasge (e.g. use CsvOptions substrait message instead of a simple map. There are Substrait messages for types already).

lidavidm pushed a commit that referenced this issue Jul 30, 2024
### Rationale for this change

### What changes are included in this PR?

Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types
And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions

### Are these changes tested?
new added UT.

### Are there any user-facing changes?
No.

* GitHub Issue: #28866

Authored-by: Chengcheng Jin <[email protected]>
Signed-off-by: David Li <[email protected]>
@lidavidm lidavidm added this to the 18.0.0 milestone Jul 30, 2024
@lidavidm
Copy link
Member

Issue resolved by pull request 41646
#41646

jinchengchenghh added a commit to jinchengchenghh/arrow that referenced this issue Jul 30, 2024
jinchengchenghh added a commit to jinchengchenghh/arrow that referenced this issue Jul 30, 2024
lidavidm pushed a commit that referenced this issue Aug 6, 2024
…#43482)

### Rationale for this change

Support more CSV fragment scan options

### What changes are included in this PR?

Implement nearly all cpp code supported CsvFragmentScanOptions

### Are these changes tested?
Yes, new added test and exists UT.

### Are there any user-facing changes?
No.

* GitHub Issue: #28866
* GitHub Issue: #43483

Authored-by: Chengcheng Jin <[email protected]>
Signed-off-by: David Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants