GH-28866: [Java] Java Dataset API ScanOptions expansion #41646

jinchengchenghh · 2024-05-14T08:15:51Z

Rationale for this change

What changes are included in this PR?

Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types
And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions

Are these changes tested?

new added UT.

Are there any user-facing changes?

No.

GitHub Issue: [Java] Java Dataset API ScanOptions expansion #28866

github-actions · 2024-05-14T08:16:15Z

⚠️ GitHub issue #28866 has been automatically assigned in GitHub to PR creator.

jinchengchenghh · 2024-05-16T08:45:10Z

Can you help review this PR? If the framework is OK, I will add more common config in this PR. Thanks! @westonpace

jinchengchenghh · 2024-05-16T08:48:05Z

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/csv/CsvFragmentScanOptions.java

+   */
+  public ByteBuffer serialize() {
+    Map<String, String> options = Stream.concat(Stream.concat(readOptions.entrySet().stream(),
+            parseOptions.entrySet().stream()),


Insert all the options to a map because it is a easy implement, and now we don't have same option name in CPP parse_options and read_options, but to further extend, we may need to serialize more accurately. I'm open to here if you think we should serialize each option

I believe having a better serialize option for each would be better. But I see your point, maybe we could do it in a follow up PR.

cc @lidavidm @westonpace

lidavidm · 2024-05-16T10:06:46Z

CC @vibhatha

jinchengchenghh · 2024-05-17T01:47:30Z

cpp/thirdparty/versions.txt

@@ -108,7 +108,7 @@ ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=f989a862f694e7dbb695925ddb7c4ce06aa6c51aca
 ARROW_S2N_TLS_BUILD_VERSION=v1.3.35
 ARROW_S2N_TLS_BUILD_SHA256_CHECKSUM=9d32b26e6bfcc058d98248bf8fc231537e347395dd89cf62bb432b55c5da990d
 ARROW_THRIFT_BUILD_VERSION=0.16.0
-ARROW_THRIFT_BUILD_SHA256_CHECKSUM=f460b5c1ca30d8918ff95ea3eb6291b3951cf518553566088f3f2be8981f6209
+ARROW_THRIFT_BUILD_SHA256_CHECKSUM=df2931de646a366c2e5962af679018bca2395d586e00ba82d09c0379f14f8e7b


Occasional change, for my local environment, will remove it

vibhatha · 2024-05-17T22:39:41Z

cpp/src/arrow/dataset/file_csv.cc

+        column_types[field->name()] = field->type();
+      }
+    } else {
+      return Status::Invalid("Not support this config " + it.first);


Maybe:

Suggested change

return Status::Invalid("Not support this config " + it.first);

return Status::Invalid("Config " + it.first + " is not supported.");

vibhatha · 2024-05-17T22:40:31Z

cpp/src/arrow/engine/substrait/extension_internal.cc

+  }
+
+  if (!literal.has_map()) {
+    return Status::Invalid("Literal does not have map");


nit:

Suggested change

return Status::Invalid("Literal does not have map");

return Status::Invalid("Literal does not have a map");

vibhatha · 2024-05-17T22:42:28Z

java/dataset/src/main/cpp/jni_wrapper.cc

+#endif
+    default:
+      std::string error_message =
+          "illegal file format id: " + std::to_string(file_format_id);


nit:

Suggested change

"illegal file format id: " + std::to_string(file_format_id);

"Illegal file format id: " + std::to_string(file_format_id);

vibhatha · 2024-05-17T22:56:16Z

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/FragmentScanOptions.java

+   * @param config config map
+   * @return bufer to jni call argument, should be DirectByteBuffer
+   */
+  default ByteBuffer serializeMap(Map<String, String> config) {


Is this function just written to pass a Java Map to C++ via JNI?

Can we put it as a private static helper somewhere? No need to expose it publicly as an instance method

vibhatha · 2024-05-17T23:04:11Z

cpp/src/arrow/dataset/file_csv.cc

+    } else if (key == "quoting") {
+      options->parse_options.quoting = parseBool(value);
+    } else if (key == "column_type") {
+      int64_t schema_address = std::stol(value);


should we check for possible -1 ?

I changed it in Java side to not add invalid schema address

vibhatha · 2024-05-17T23:17:16Z

java/dataset/src/main/java/org/apache/arrow/dataset/substrait/StringMapNode.java

+
+import io.substrait.proto.Expression;
+
+public class StringMapNode implements Serializable {


Just looking at the functionality, I think what we have here is a util class which converts a particular map config to a particular Substrait protobuf message. Since this can be used in other cases, it could come under substrait.util package. And the toProtobuf could be mapToExpressionLiteral() ?

I also have doubts about having a separate class for this purpose though.

vibhatha

I only added a few comments. But I am going to go through the content once more.

cpp/src/arrow/dataset/file_csv.cc

lidavidm · 2024-07-29T06:34:20Z

@github-actions crossbow submit java-jars

github-actions · 2024-07-29T06:36:39Z

Revision: 04e4390

Submitted crossbow builds: ursacomputing/crossbow @ actions-4da86e64a2

Task	Status
java-jars

lidavidm · 2024-07-29T07:19:27Z

Sorry, do you mind rebasing again so we can validate the JNI job?

lidavidm · 2024-07-29T07:26:19Z

@github-actions crossbow submit java-jars

github-actions · 2024-07-29T07:28:38Z

Revision: 98a3beb

Submitted crossbow builds: ursacomputing/crossbow @ actions-f053a0e3dd

Task	Status
java-jars

lidavidm · 2024-07-29T08:21:16Z

Sigh, well, another update broke that pipeline again..

vibhatha · 2024-07-29T09:44:55Z

Sigh, well, another update broke that pipeline again..

Hmm, seems like that. I will take a look.

conbench-apache-arrow · 2024-07-30T15:34:36Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit fd69e5e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 24 possible false positives for unstable benchmarks that are known to sometimes produce them.

jinchengchenghh · 2024-07-31T07:28:33Z

I find arrow-dataset tests don't run in github Action, is it right? only arrow-c-data test runs.
because environment ARROW_DATASET not set. @lidavidm @vibhatha
https://github.com/apache/arrow/blob/main/ci/scripts/java_test.sh#L41
https://github.com/apache/arrow/blob/main/ci/docker/java-jni-manylinux-201x.dockerfile#L53

vibhatha · 2024-07-31T07:32:17Z

@lidavidm this seems to be wrong 🤔

lidavidm · 2024-07-31T13:57:33Z

Can you make a PR?

vibhatha · 2024-07-31T15:28:57Z

Sure I can.

jinchengchenghh · 2024-08-01T01:05:56Z

I create a issue #43502 @vibhatha
CC @lidavidm

vibhatha · 2024-08-01T01:09:45Z

Thanks @jinchengchenghh

vibhatha · 2024-08-01T01:29:24Z

I have created a PR: #43503

jinchengchenghh requested a review from lidavidm as a code owner May 14, 2024 08:15

jinchengchenghh marked this pull request as draft May 14, 2024 08:15

github-actions bot added Component: Java awaiting review Awaiting review labels May 14, 2024

jinchengchenghh force-pushed the option branch from 883417a to 5423c7e Compare May 14, 2024 23:28

github-actions bot added the Component: C++ label May 15, 2024

jinchengchenghh force-pushed the option branch from c90f078 to df5705e Compare May 15, 2024 22:37

jinchengchenghh marked this pull request as ready for review May 16, 2024 08:43

jinchengchenghh requested a review from westonpace as a code owner May 16, 2024 08:43

jinchengchenghh commented May 16, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 16, 2024

jinchengchenghh commented May 17, 2024

View reviewed changes

jinchengchenghh force-pushed the option branch from a48638b to 1223f36 Compare May 17, 2024 08:16

vibhatha reviewed May 17, 2024

View reviewed changes

jinchengchenghh force-pushed the option branch from 6d97d44 to 608f568 Compare May 20, 2024 10:00

lidavidm reviewed May 22, 2024

View reviewed changes

cpp/src/arrow/dataset/file_csv.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 22, 2024

lidavidm approved these changes Jul 29, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jul 29, 2024

jinchengchenghh force-pushed the option branch from 04e4390 to 98a3beb Compare July 29, 2024 07:24

jinchengchenghh added 6 commits July 29, 2024 15:02

support csv option

5dcb72a

address comments

826d442

minor

e96a087

minor

78a27db

address comments

a3f817a

fix compile

98a3beb

lidavidm merged commit fd69e5e into apache:main Jul 30, 2024
14 checks passed

lidavidm removed the awaiting merge Awaiting merge label Jul 30, 2024

lidavidm mentioned this pull request Jul 30, 2024

[Java] Java Dataset API ScanOptions expansion #28866

Closed

jinchengchenghh mentioned this pull request Aug 1, 2024

[Java] Fix Java JNI / AMD64 manylinux2014 Java JNI test not test dataset module #43502

Closed

jinchengchenghh mentioned this pull request Aug 1, 2024

GH-43502: [Java] Fix Java JNI / AMD64 manylinux2014 Java JNI test not test dataset module #43503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-28866: [Java] Java Dataset API ScanOptions expansion #41646

GH-28866: [Java] Java Dataset API ScanOptions expansion #41646

jinchengchenghh commented May 14, 2024 •

edited

Loading

github-actions bot commented May 14, 2024

jinchengchenghh commented May 16, 2024

jinchengchenghh May 16, 2024

vibhatha May 17, 2024

lidavidm commented May 16, 2024

jinchengchenghh May 17, 2024

vibhatha May 17, 2024

vibhatha May 17, 2024

vibhatha May 17, 2024

vibhatha May 17, 2024

jinchengchenghh May 18, 2024

lidavidm Jun 19, 2024

vibhatha May 17, 2024

jinchengchenghh May 22, 2024

vibhatha May 17, 2024

vibhatha May 17, 2024

vibhatha left a comment

lidavidm commented Jul 29, 2024

github-actions bot commented Jul 29, 2024

lidavidm commented Jul 29, 2024

lidavidm commented Jul 29, 2024

github-actions bot commented Jul 29, 2024

lidavidm commented Jul 29, 2024

vibhatha commented Jul 29, 2024

conbench-apache-arrow bot commented Jul 30, 2024

jinchengchenghh commented Jul 31, 2024

vibhatha commented Jul 31, 2024

lidavidm commented Jul 31, 2024

vibhatha commented Jul 31, 2024

jinchengchenghh commented Aug 1, 2024

vibhatha commented Aug 1, 2024

vibhatha commented Aug 1, 2024

	return Status::Invalid("Not support this config " + it.first);
	return Status::Invalid("Config " + it.first + " is not supported.");

	return Status::Invalid("Literal does not have map");
	return Status::Invalid("Literal does not have a map");

	"illegal file format id: " + std::to_string(file_format_id);
	"Illegal file format id: " + std::to_string(file_format_id);


		import io.substrait.proto.Expression;

		public class StringMapNode implements Serializable {

GH-28866: [Java] Java Dataset API ScanOptions expansion #41646

GH-28866: [Java] Java Dataset API ScanOptions expansion #41646

Conversation

jinchengchenghh commented May 14, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 14, 2024

jinchengchenghh commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha left a comment

Choose a reason for hiding this comment

lidavidm commented Jul 29, 2024

github-actions bot commented Jul 29, 2024

lidavidm commented Jul 29, 2024

lidavidm commented Jul 29, 2024

github-actions bot commented Jul 29, 2024

lidavidm commented Jul 29, 2024

vibhatha commented Jul 29, 2024

conbench-apache-arrow bot commented Jul 30, 2024

jinchengchenghh commented Jul 31, 2024

vibhatha commented Jul 31, 2024

lidavidm commented Jul 31, 2024

vibhatha commented Jul 31, 2024

jinchengchenghh commented Aug 1, 2024

vibhatha commented Aug 1, 2024

vibhatha commented Aug 1, 2024

jinchengchenghh commented May 14, 2024 •

edited

Loading