Make partitioned parquet reading deterministic #4739

devinrsmith · 2023-10-29T20:01:55Z

This ensures that the firstEntryPath is the lexicographically first entry.

Part of #4738

This ensures that the firstEntryPath is the lexicographically first entry. Fixes deephaven#4738

rcaudy · 2023-10-30T14:05:26Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

@@ -571,11 +572,13 @@ private static Table readTableInternal(
            }
            final Path firstEntryPath;
            try (final DirectoryStream<Path> sourceStream = Files.newDirectoryStream(sourcePath)) {
-                final Iterator<Path> entryIterator = sourceStream.iterator();
-                if (!entryIterator.hasNext()) {
+                // Lexicographical comparison


Two things to raise here:

@malhotrashivam take note, this may be a conflict with your PR

@devinrsmith / @malhotrashivam I think we might prefer to take the max, instead of the min. If we think lexicographic order matters, it's often the case that there is a sortable timestamp (e.g. YYYY-MM-DD-blah) in the name. Taking the most recent gives us a better guess at the intended schema.

So, the actual inference step is separate from this outer layer (readPartitionedTableInferSchema). My goal here was to make the existing behavior more deterministic; in this case, I think taking the min entry more closely matches the intentions and probable behavior of the existing code.

This doesn't make anything more deterministic, unless you read the DirectoryStream JavaDocs with no context. Directories have an order on every file system I'm aware of.

devinrsmith · 2023-10-30T15:12:26Z

I'm actually doubting my change is complete looking back at it - it is only setting us up deterministically for ParquetKeyValuePartitionedLayout vs ParquetFlatPartitionedLayout, but not how those work internally. I'm going to take another pass at this soon.

…ic-parquet-read

devinrsmith · 2023-11-06T18:36:36Z

I think there is still value in merging this PR to make things more deterministic, but it doesn't actually fix the inference step. Will leave that for later.

malhotrashivam · 2023-11-06T19:32:34Z

I think the change does what's it supposed to do. Do you think we should add a tiny test for this?

devinrsmith · 2023-11-07T17:03:59Z

Closed in favor of more deterministic inference in #4783

Make partitioned parquet reading deterministic

95fe9a1

This ensures that the firstEntryPath is the lexicographically first entry. Fixes deephaven#4738

devinrsmith added bug Something isn't working parquet Related to the Parquet integration NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. labels Oct 29, 2023

devinrsmith added this to the October 2023 milestone Oct 29, 2023

devinrsmith requested a review from rcaudy October 29, 2023 20:01

devinrsmith self-assigned this Oct 29, 2023

rcaudy requested a review from malhotrashivam October 30, 2023 14:02

rcaudy reviewed Oct 30, 2023

View reviewed changes

devinrsmith modified the milestones: October 2023, November 2023 Nov 2, 2023

devinrsmith added 2 commits November 6, 2023 10:17

Merge remote-tracking branch 'upstream/main' into fix-non-determinist…

7196669

…ic-parquet-read

Extract filter

81facb6

devinrsmith requested a review from rcaudy November 6, 2023 18:35

devinrsmith closed this Nov 7, 2023

github-actions bot locked and limited conversation to collaborators Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make partitioned parquet reading deterministic #4739

Make partitioned parquet reading deterministic #4739

devinrsmith commented Oct 29, 2023 •

edited

Loading

rcaudy Oct 30, 2023

devinrsmith Nov 6, 2023

rcaudy Nov 6, 2023

devinrsmith commented Oct 30, 2023

devinrsmith commented Nov 6, 2023

malhotrashivam commented Nov 6, 2023

devinrsmith commented Nov 7, 2023

Make partitioned parquet reading deterministic #4739

Make partitioned parquet reading deterministic #4739

Conversation

devinrsmith commented Oct 29, 2023 • edited Loading

rcaudy Oct 30, 2023

Choose a reason for hiding this comment

devinrsmith Nov 6, 2023

Choose a reason for hiding this comment

rcaudy Nov 6, 2023

Choose a reason for hiding this comment

devinrsmith commented Oct 30, 2023

devinrsmith commented Nov 6, 2023

malhotrashivam commented Nov 6, 2023

devinrsmith commented Nov 7, 2023

devinrsmith commented Oct 29, 2023 •

edited

Loading