Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatchTransformer: Handle schema migration and column re-ordering in table scans #602

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Sep 4, 2024

Addresses parts 2 and 3 of #405.

Add support for type promotion and default values to the read pipeline.

  • When the scan includes fields that have undergone type promotion since some of the underlying parquet files were written, any selected fields that have undergone type promotion will be promoted to the newer type before being returned to the user. For example, A table contained a field "a" that was a Float, and some rows were written. The table's schema was changed so that field "a" is now a Double. When a table scan is performed, record batches coming from files written prior to the type promotion will be dynamically converted so that field "a" is of type Double, matching any row batches returned from files written after the schema change.
  • When the scan includes fields that have been added since some of the files were written, record batches will be dynamically converted as per the above to contain selected fields that were not present at the time the file that they were in was written. These will have a value of null if there is no default value present for the column but the column is not required. If the table schema specifies an initial-default-value for the field, then all rows will have that value for the new column instead.
  • If any fields have been renamed, the record batch schemas for rows written before the rename occurred will be rewritten to contain the new field name.
  • If projected_field_ids is provided, the columns in the response will be re-ordered to match the order in the projection.

@sdd
Copy link
Contributor Author

sdd commented Sep 6, 2024

I need to go back to the drawing board on this. The current implementation breaks when not all columns in the file are in the list of projected fields.

@sdd
Copy link
Contributor Author

sdd commented Sep 9, 2024

OK, I've addressed the problem with projections other than the equiv of SELECT *. All existing tests passing, and new tests extended to cover these cases.

@sdd sdd marked this pull request as ready for review September 9, 2024 18:03
@sdd sdd force-pushed the record-batch-evolution-processor branch from 73c6724 to bf6d2ef Compare September 9, 2024 18:08
@sdd
Copy link
Contributor Author

sdd commented Sep 9, 2024

@liurenjie1024 and @Xuanwo - ready for review when you get chance

@sdd sdd mentioned this pull request Sep 12, 2024
@sdd sdd force-pushed the record-batch-evolution-processor branch from 69e2f95 to 36a00fc Compare September 23, 2024 06:08
@sdd sdd changed the title Schema Evolution RecordBatch processor RecordBatchTransformer: Handle schema migration and column re-ordering in table scans Sep 23, 2024
@sdd sdd force-pushed the record-batch-evolution-processor branch from 53a1267 to ef3a9fc Compare September 23, 2024 07:14
@sdd
Copy link
Contributor Author

sdd commented Sep 23, 2024

@Xuanwo and @liurenjie1024: PTAL, I've rebased and refactored this to better handle the pass-through case. It would be great to get this merged - without it, I'm experiencing some annoying issues, as table scans against tables that have had new columns added where pre-existing data exists without the added column, and then had new data written in the new schema, results in record batches being streamed back that can have different schemas in the same stream, which is very difficult to deal with.

@sdd sdd force-pushed the record-batch-evolution-processor branch from ef3a9fc to e0a1ac8 Compare September 23, 2024 08:03
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this!

@Xuanwo
Copy link
Member

Xuanwo commented Sep 23, 2024

Waiting for @liurenjie1024 to take another look.

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this high quality pr, looks great to me! I believe current approach works well without nested types, and we can add support for nested types later. Just one minor point about arrow dependency, others look great!

Cargo.toml Outdated Show resolved Hide resolved
crates/iceberg/Cargo.toml Outdated Show resolved Hide resolved
@sdd sdd force-pushed the record-batch-evolution-processor branch from e0a1ac8 to 3c9bbd1 Compare October 3, 2024 21:23
@sdd sdd force-pushed the record-batch-evolution-processor branch from 3c9bbd1 to 61d4bdc Compare October 3, 2024 21:25
@sdd
Copy link
Contributor Author

sdd commented Oct 3, 2024

Hi @liurenjie1024 - I addressed your comment around not importing the whole of Arrow.

Additionally, I realised that there was an optimisation that could be made in the case that only the column names differed between the source and target schemas, so I've added code for that use case.

Also, the schemas can contain metadata which, where present, would cause the equality comparison to fail, so I changed the equality check to only compare on the aspects of the schema that we're interested in (data type, nullability, and column name).

This is now ready for re-review (FAO @Xuanwo for this also).

Thanks! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants