Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove TableSchema to JSON conversion. #28274

Conversation

damondouglas
Copy link
Contributor

@damondouglas damondouglas commented Aug 31, 2023

This PR addresses #28080 by refactoring the static BigQueryIO.read(SerializableFunction) to not need a try/catch when invoking the DatumReaderFactory.

The existing code first converted the TableSchema to a JSON string using BigQueryIO.JSON_FACTORY's toString method. This throws an IOException that the original code logged as a warning. A GenericDatumTransformer's constructor received that JSON string that was used to set a private Suppler property, called tableSchema. This Supplier property, when invoked, would convert the JSON string back to a TableSchema object.

This PR simplifies the code by creating a new GenericDatumTransformer constructor that receives the TableSchema directly and sets the internal TableSchema property, removing the need for a Supplier and the need to re-parse the JSON string back to a TableSchema.

Since the existing GenericDatumTransformer constructor that received a JSON string representation of the TableSchema was public, I did not remove it.

To validate these changes, I ran:

./gradlew :sdks:java:io:google-cloud-platform:check

and

./gradlew :sdks:java:io:google-cloud-platform:integrationTest


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
    Update CHANGES.md with noteworthy changes.
    If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@damondouglas damondouglas marked this pull request as ready for review September 1, 2023 16:11
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2023

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@damondouglas
Copy link
Contributor Author

assign set of reviewers

@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2023

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn
Copy link
Contributor

Abacn commented Sep 6, 2023

It is not obvious to me wether it was intended / this would introduce breaking change. Have you tried injecting an error in the try block and see when and where the exception will throw?

@damondouglas
Copy link
Contributor Author

@Abacn Thank you for reviewing. This was challenging to replicate. Due to the use of a static final BigQueryIO.JSON_FACTORY. However, when I executed the tests in debug mode, I observed this lambda was called first. We should prevent downstream processing of data if the tableschema cannot be parsed which would present bigger problems later.

@Abacn
Copy link
Contributor

Abacn commented Sep 7, 2023

I mean we can just change the current PR to

.setDatumReaderFactory(
            (SerializableFunction<TableSchema, AvroSource.DatumReaderFactory<T>>)
                input -> {throw new IllegalStateException("test exception");}

or throw the Exception by chance, e.g.

private static Random = new Random();

...
.setDatumReaderFactory(
            (SerializableFunction<TableSchema, AvroSource.DatumReaderFactory<T>>)
                input -> {
                  TableSchema safeInput = checkStateNotNull(input);
                  if (random.nextDouble() < _some_probability) throw new IllegalStateException("test exception");
                  try {
                    String jsonTableSchema = BigQueryIO.JSON_FACTORY.toString(safeInput);
                    return (AvroSource.DatumReaderFactory<T>)
                        (writer, reader) ->
                            new GenericDatumTransformer<>(parseFn, jsonTableSchema, writer);
                  } catch (IOException e) {
                    throw new IllegalStateException(
                        String.format(
                            "error converting TableSchema to JSON: %s, error: %s", safeInput, e));
                  }
                })

and see how the tests behaves. In general for changes that add exception to throw we can always tests like this

String jsonTableSchema = BigQueryIO.JSON_FACTORY.toString(input);
return (AvroSource.DatumReaderFactory<T>)
(writer, reader) ->
new GenericDatumTransformer<>(parseFn, jsonTableSchema, writer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyzing the code further, I saw that the original GenericDatumTransformer's constructor received a JSON string so that later it would re-parse that JSON into a TableSchema again but via a Supplier function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was because TableSchema is not serializable

}
TableSchema safeInput = checkStateNotNull(input);
return (AvroSource.DatumReaderFactory<T>)
(writer, reader) -> new GenericDatumTransformer<>(parseFn, safeInput, writer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the new GenericDatumTransformer constructor that just takes the TableSchema input directly instead of needing to convert to JSON and then back again to the TableSchema later.

@@ -689,7 +696,7 @@ public void setSchema(org.apache.avro.Schema schema) {
@Override
public T read(T reuse, Decoder in) throws IOException {
GenericRecord record = (GenericRecord) this.reader.read(reuse, in);
return parseFn.apply(new SchemaAndRecord(record, this.tableSchema.get()));
return parseFn.apply(new SchemaAndRecord(record, this.tableSchema));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removes the use of a Supplier function to store the TableSchema and simply uses a class property to hold the TableSchema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I digged into this was added back to 7fde976 (https://issues.apache.org/jira/browse/BEAM-2532) then refactored into BigQueryIO.java by #22718. Is BEAM-2532 no longer an issue?

Copy link
Contributor

@Abacn Abacn Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(answer myself) I think after the change of #22718, the supplier is not needed, as it no longer involves preserving a TableSchema in SerializableFunction. It is now a named static class GenericDatumTransformer not implementing Serializable.

@damondouglas damondouglas changed the title Rethrow error converting TableSchema to JSON Remove TableSchema to JSON conversion. Sep 13, 2023
@damondouglas
Copy link
Contributor Author

R: @johnjcasey :-) Thank you

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@damondouglas
Copy link
Contributor Author

Run Java_GCP_IO_Direct PreCommit

this.tableSchema =
Suppliers.memoize(
Suppliers.compose(new TableSchemaFunction(), Suppliers.ofInstance(tableSchema)));
this.tableSchema = new TableSchemaFunction().apply(tableSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove TableSchemaFunction as well as this first constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this suggestion. It was needed to convert the JSON string constructor parameter into a TableSchema. I felt that since the method was public, we needed to still support it and thus needed this conversion using TableSchemaFunction. @Abacn in light of this context, do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableSchemaFunction is private and GenericDatumTransformer is package private. The public constructor just mean it can be called within the package

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as of this PR itself the current change suffices and LGTM. This could be cleaned up / or leave it as is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to removing TableSchemaFunction and the constructor, it can't be accessed by users anyways. We can get the tableschema from json directly with BigQueryHelpers.fromJsonString(tableschema, TableSchema.class)

@damondouglas
Copy link
Contributor Author

@johnjcasey / @Abacn Java_GCP_IO_Direct keeps failing for a reason seemingly independent of this PR's code changes. I was able to validate these changes on my own running the two gradle commands in the description.

@Abacn
Copy link
Contributor

Abacn commented Sep 14, 2023

@johnjcasey / @Abacn Java_GCP_IO_Direct keeps failing for a reason seemingly independent of this PR's code changes. I was able to validate these changes on my own running the two gradle commands in the description.

This was a test recently added, as part of effort for test coverage of BigQuery IO. It should be irrelevant (storage api write) CC: @ahmedabu98

@ahmedabu98
Copy link
Contributor

Run Java_GCP_IO_Direct PreCommit

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment but otherwise looks good, lets see if this test passes

this.tableSchema =
Suppliers.memoize(
Suppliers.compose(new TableSchemaFunction(), Suppliers.ofInstance(tableSchema)));
this.tableSchema = new TableSchemaFunction().apply(tableSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to removing TableSchemaFunction and the constructor, it can't be accessed by users anyways. We can get the tableschema from json directly with BigQueryHelpers.fromJsonString(tableschema, TableSchema.class)

@github-actions
Copy link
Contributor

github-actions bot commented Sep 16, 2023

Test Results

1 905 tests   1 731 ✔️  1h 55m 5s ⏱️
   202 suites     172 💤
   202 files           2

For more details on these failures, see this check.

Results for commit e8a738a.

♻️ This comment has been updated with latest results.

@damondouglas damondouglas merged commit 7e83059 into apache:master Sep 18, 2023
16 of 18 checks passed
@damondouglas damondouglas deleted the i-28080/java/BigQueryIO/datumReaderFactory branch September 18, 2023 16:10
Abacn added a commit that referenced this pull request Sep 19, 2023
Abacn added a commit that referenced this pull request Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants