Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixes and improved support for Parquet TIMESTAMP #4801

Merged
merged 5 commits into from
Nov 10, 2023

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Nov 9, 2023

As part of #4421, we started throwing an exception on reading Parquet TIMESTAMP fields with isAdjustedToUTC set as false. After this change:

  • Such fields will be read as java.time.LocalDateTime
  • java.time.LocalDateTime columns will be written as Parquet TIMESTAMP fields with isAdjustedToUTC=false. Earlier they were written as binary data with a codec.

Also, this PR fixes the bugs introduced in #4775 and #4755 that can lead to a ClassCastException in some cases on reading Parquet DATE and TIME columns.

Related to #976

@malhotrashivam malhotrashivam added feature request New feature or request parquet Related to the Parquet integration NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. labels Nov 9, 2023
@malhotrashivam malhotrashivam added this to the November 2023 milestone Nov 9, 2023
@malhotrashivam malhotrashivam self-assigned this Nov 9, 2023
jmao-denver
jmao-denver previously approved these changes Nov 9, 2023
Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python changes LGTM.

return LogicalTypeAnnotation.timeType(true, LogicalTypeAnnotation.TimeUnit.MICROS);
case TIMESTAMP_MILLIS:
// TODO(deephaven-core#976) Assuming that time is adjusted to UTC
return LogicalTypeAnnotation.timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS);
// Converted type doesn't have isAdjustedToUTC parameter, so use the information from logical type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ugly hack that I had to do because of the code inside ParquetFileReader::buildChildren which looks like this:

if (schemaElement.isSetConverted_type()) {
    LogicalTypeAnnotation originalType =
            getLogicalTypeAnnotation(schemaElement.converted_type, schemaElement);
    LogicalTypeAnnotation newOriginalType = schemaElement.isSetLogicalType()
            && getLogicalTypeAnnotation(schemaElement.logicalType) != null
                    ? getLogicalTypeAnnotation(schemaElement.logicalType)
                    : null;
    if (!originalType.equals(newOriginalType)) {
        ((Types.Builder) childBuilder).as(originalType);
    }
}

Mainly we are trying to deduce the type of the column from schemaElement.converted_type and schemaElement.logicalType and in case of mismatch, we go with type deduced from schemaElement.converted_type.

The problem is that schemaElement.converted_type doesn't have any information about adjustments to UTC. So to deduce it correctly, I had to read the isAdjustedToUTC from the logical type.

Ideally, I would have liked to change the above code to prefer the type deduced from schemaElement.logicalType over schemaElement.converted_type because converted_type is deprecated and superseded by logicalType (based on the Javadoc for converted_type).
But this would be a much bigger change and impact all parquet types. So I added this hack which just impacts the TIMESTAMP type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the side effect of switching to use the logical type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a test commit here and started the nightly jobs. Let me see how this work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nightly jobs run fine.

return LogicalTypeAnnotation.timeType(true, LogicalTypeAnnotation.TimeUnit.MICROS);
case TIMESTAMP_MILLIS:
// TODO(deephaven-core#976) Assuming that time is adjusted to UTC
return LogicalTypeAnnotation.timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS);
// Converted type doesn't have isAdjustedToUTC parameter, so use the information from logical type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the side effect of switching to use the logical type?

@malhotrashivam malhotrashivam changed the title Added support to read and write Parquet TIMESTAMP fields with isAdjustedToUTC=false Bug fixes and adding support for Parquet TIMESTAMP with isAdjustedToUTC=false Nov 10, 2023
@malhotrashivam malhotrashivam changed the title Bug fixes and adding support for Parquet TIMESTAMP with isAdjustedToUTC=false Bug fixes and improved support for Parquet TIMESTAMP Nov 10, 2023
@malhotrashivam malhotrashivam added the bug Something isn't working label Nov 10, 2023
@malhotrashivam malhotrashivam merged commit d162c89 into deephaven:main Nov 10, 2023
12 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 10, 2023
@@ -981,6 +982,21 @@ public static long epochNanos(@Nullable final ZonedDateTime dateTime) {
return safeComputeNanos(dateTime.toEpochSecond(), dateTime.getNano());
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case my comments seem strong, this library is very exposed to users, so it needs to be ultra currated. Functions should only get added when there is a compelling reason.

  1. I am not a fan of having hard-coded methods for any specific timezone.
  2. The methods should accept a time zone as an input.
  3. If methods are added for LocalDateTime, LocalDateTime signatures should be added to all relevant methods.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working feature request New feature or request NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants