Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coerce types on read #76

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

aykut-bozkurt
Copy link
Collaborator

@aykut-bozkurt aykut-bozkurt commented Nov 11, 2024

COPY FROM parquet is too strict when matching Postgres tupledesc schema to the parquet file schema.
e.g. INT32 type in the parquet schema cannot be read into a Postgres column with int64 type.
We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of arrow-cast crate, which is in the same project
with arrow. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.

To make sure the cast is possible, we need to do 2 checks:

  1. arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is
    generated for tupledesc",
  2. the cast is meaningful at Postgres. We check if there is an explicit cast from "Postgres type that corresponds
    for the arrow type at Parquet file" to "Postgres type at tupledesc".

With that we can cast between many castable types as shown below:

  • INT16 => INT32
  • UINT32 => INT64
  • FLOAT32 => FLOAT64
  • LargeUtf8 => UTF8
  • LargeBinary => Binary
  • Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64}

NOTE: Struct fields must match by name and position to be cast.

Closes #67.

Part of #49.

Copy link

codecov bot commented Nov 11, 2024

Codecov Report

Attention: Patch coverage is 83.49328% with 172 lines in your changes missing coverage. Please review.

Project coverage is 91.63%. Comparing base (518a5ac) to head (a343a8a).

Files with missing lines Patch % Lines
src/lib.rs 82.39% 116 Missing ⚠️
src/arrow_parquet/schema_parser.rs 73.01% 51 Missing ⚠️
src/arrow_parquet/arrow_to_pg.rs 96.21% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #76      +/-   ##
==========================================
- Coverage   92.83%   91.63%   -1.20%     
==========================================
  Files          62       62              
  Lines        7645     8478     +833     
==========================================
+ Hits         7097     7769     +672     
- Misses        548      709     +161     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/coerce-types-on-read branch 13 times, most recently from 0c87ae1 to 984567b Compare November 14, 2024 17:03
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project
with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.

To make sure the cast is possible, we need to do 2 checks:
1. arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is
   generated for tupledesc",
2. the cast is meaningful at Postgres. We check if there is an explicit cast from "Postgres type that corresponds
   for the arrow type at Parquet file" to "Postgres type at tupledesc".

With that we can cast between many castable types as shown below:
- INT16 => INT32
- UINT32 => INT64
- FLOAT32 => FLOAT64
- LargeUtf8 => UTF8
- LargeBinary => Binary
- Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64}

**NOTE**: Struct fields must match by name and position to be cast.

Closes #67.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error
1 participant