Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to read partitioned parquet files from S3 #5206

Merged
merged 41 commits into from
Apr 12, 2024

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Feb 28, 2024

Closes #5065

Breaking Change: Renamed KeyValuePartitionLayout to FileKeyValuePartitionLayout.
Reason: KeyValuePartitionLayout is now more generic and has common functionalities which are used by both FileKeyValuePartitionLayout and a new URIStreamKeyValuePartitionLayout. As the name suggestes, FileKeyValuePartitionLayout is used to process key-value partitioned data accessed through java File objects, whereas URIStreamKeyValuePartitionLayout is used to process URIs.

@malhotrashivam malhotrashivam added parquet Related to the Parquet integration DocumentationNeeded ReleaseNotesNeeded Release notes are needed s3 labels Feb 28, 2024
@malhotrashivam malhotrashivam self-assigned this Feb 28, 2024
@malhotrashivam malhotrashivam changed the title [WIP] Added support to read flat partitioned parquet files from S3 [WIP] Added support to read flat partitioned directory of parquet files from S3 Feb 28, 2024
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python LGTM

Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python LGTM

@malhotrashivam malhotrashivam changed the title [WIP] Added support to read flat partitioned directory of parquet files from S3 [WIP] Added support to read partitioned parquet files from S3 Mar 19, 2024
@@ -849,7 +849,11 @@ private static Pair<TableDefinition, ParquetInstructions> infer(
allColumns.add(ColumnDefinition.fromGenericType(partitionKey, dataType, null,
ColumnDefinition.ColumnType.Partitioning));
}
allColumns.addAll(schemaInfo.getFirst());
// Only read non-partitioning columns from the parquet files
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed because the parquet files written by Iceberg had partitioning columns data included in the parquet files. So was leading to duplicate columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure we aren't breaking any existing behavior w/ this change...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are passing and I am able to read all the files in deephaven-examples.
I will wait for Ryan to have a final comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a question for partitioned writing: should we include partitioning columns redundantly in the parquet files we write?

@malhotrashivam malhotrashivam dismissed stale reviews from chipkent and devinrsmith via b6876cd April 8, 2024 20:02
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to review s3-related files.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing last commit before refactor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing unit tests for the new functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chipkent, the new changes that I added since your last review don't add any new functionality in the python code. I just deprecated a number of APIs in the Java code and used the new APIs in the python code. So the existing tests completely cover all the cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the python code in this PR to the state that you last approved and merged this PR.
All the follow up work is now being done in #5358.

@malhotrashivam
Copy link
Contributor Author

malhotrashivam commented Apr 12, 2024

I have reverted the last two comments and will move the refactoring for ParquetTools to a separate PR. All the pending refactoring related comments will be solved as part of that PR (#5358).

For this PR, I have marked the four new methods as deprecated, and these four will be deleted in the next refactoring PR.

@malhotrashivam malhotrashivam merged commit bf6fcdb into deephaven:main Apr 12, 2024
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#189

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
breaking DocumentationNeeded parquet Related to the Parquet integration ReleaseNotesNeeded Release notes are needed s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read partitioned parquet data from S3
6 participants