-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] Java Dataset API ScanOptions expansion #28866
Comments
Hongze Zhang / @zhztheplayer: |
Sebastiaan Alvarez Rodriguez: |
Weston Pace / @westonpace: |
Sebastiaan Alvarez Rodriguez: I think it is good to go then. |
I also have the same requirement. I use dataset scan CSV, but cannot set the parseOptions and readOptions by java ScanOptions, I find ScanBuilder can set CsvFragmentScanOptions in C++ side, so the question is how can we serialize and deserialize CsvFragmentScanOptions. I have 2 proposals:
We can do as this PR does. The first proposal will be ambiguous, because user does not know how to generate the serialized binary, they should be familliar with the AdvancedExtension detail implementation, I would prefer the second. I can help to implement this feature, can you help to confirm the proposal? @westonpace |
Is the JNI -> dataset API using substrait today? If so, then I think 1 is preferred. However, the user should not provide For example, the user Java API for "scan CSV" could take in I will add, for consideration, a third proposal, which is that we could add CSV as a read type in Substrait (e.g. make a PR on the Substrait repo). There was some work started here but it was never finished. That third option will be slower though. It would require Substrait alignment on what CSV options are common. This will take time and energy. It would be the most useful long term option. |
No, now it just uses the binary serialized substrait as jni input. https://github.com/apache/arrow/blob/main/java/dataset/src/test/java/org/apache/arrow/dataset/substrait/TestAceroSubstraitConsumer.java#L176 HashMap is OK for most of the options, but the |
If it is using serialized substrait as input then I don't understand how the C++ consumer would be able to apply the CSV options onto the plan. Would this be a second pass that transforms the plan once it is deserialized? E.g. something like:
I still think approach 1 is preferred. The extra options can be supplied as part of
I think |
### Rationale for this change ### What changes are included in this PR? Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions ### Are these changes tested? new added UT. ### Are there any user-facing changes? No. * GitHub Issue: #28866 Authored-by: Chengcheng Jin <[email protected]> Signed-off-by: David Li <[email protected]>
Issue resolved by pull request 41646 |
…#43482) ### Rationale for this change Support more CSV fragment scan options ### What changes are included in this PR? Implement nearly all cpp code supported CsvFragmentScanOptions ### Are these changes tested? Yes, new added test and exists UT. ### Are there any user-facing changes? No. * GitHub Issue: #28866 * GitHub Issue: #43483 Authored-by: Chengcheng Jin <[email protected]> Signed-off-by: David Li <[email protected]>
Currently, there are very few scanning options which we can set in the Java Dataset API [1].
Additionally, the options that exist now always must be set from Java, without the possibility to use sensible default values from core Arrow.
For my use-case, I want to be able to set the
fragment_readahead
option from the Java-side.It would be great if:
+
ScanOptions.java
would be expanded to allow us to set more, potentially all options related to scanner creation.+ Java users can omit options to use the default values, e.g. [2].
It would be good to know what others think, and whether a PR for this is useful.
[1]https://github.com/apache/arrow/blob/master/java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java
[2]
arrow/cpp/src/arrow/dataset/scanner.h
Lines 51 to 53 in ad5dc82
Reporter: Sebastiaan Alvarez Rodriguez
Note: This issue was originally created as ARROW-13166. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: