Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error #67

Open
oliora opened this issue Nov 3, 2024 · 2 comments · May be fixed by #76
Open

Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error #67

oliora opened this issue Nov 3, 2024 · 2 comments · May be fixed by #76
Labels
bug Something isn't working

Comments

@oliora
Copy link

oliora commented Nov 3, 2024

When I'm trying to copy data from local Parquet file to the database I get the following error:

type mismatch for column "bla" between table and parquet file. table expected "Utf8" but file had "LargeUtf8"

Copy command:

copy table_bla from '/path/table_bla.parquet';

Column schema:

bla              | character varying           |           |          |

Parquet file schema:

/path/table_bla.parquet | bla     | BYTE_ARRAY |             | OPTIONAL        |              | UTF8             |       |           |          | STRING
@oliora oliora changed the title Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" Cannot copy string column: table expected "Utf8" but file had "LargeUtf8" error Nov 3, 2024
@aykut-bozkurt
Copy link
Collaborator

pg_parquet writes and reads from Utf8 encoded StringArray of Arrow. Any single string in the array cannot exceed 2GB. LargeUtf8 makes this limit 4GB.

I think there is some interop issue here. pg_parquet always assume the arrow string array is encoded as Utf8. (we picked Utf8 since PG already has 1GB limit for text) The file looks like written by other tools as strings are encoded as LargeUtf8.

If you have chance to write strings as Utf8, this would be a quick solution. Otherwise, pg_parquet needs to pick the correct encoding, utf8 or largeutf8, for strings before reading the file.

@aykut-bozkurt aykut-bozkurt added the bug Something isn't working label Nov 4, 2024
@oliora
Copy link
Author

oliora commented Nov 4, 2024

I've got this file from a 3rd party (the file is probably written with Pandas or some other analytical library) so can't simply recreate it although I can probably load and save it differently with some tools. But I know for sure that LargeUtf8 fields are pretty small and will fit into corresponding columns in my database so would be convenient if pg_parquet can convert them on the fly.

aykut-bozkurt added a commit that referenced this issue Nov 11, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
@aykut-bozkurt aykut-bozkurt linked a pull request Nov 11, 2024 that will close this issue
aykut-bozkurt added a commit that referenced this issue Nov 11, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 11, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 11, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 11, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 12, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 12, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 12, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 12, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 14, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 14, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the schema from parquet file.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by adding a `is_coercible(from_type, to_type)` check while matching the expected schema
from the parquet file.

With that we can coerce as shown below from parquet source type to Postgres destination types:
- INT16 => {int32, int64}
- INT32 => {int64}
- UINT16 => {int16, int32, int64}
- UINT32 => {int32, int64}
- UINT64 => {int64}
- FLOAT32 => {double}

As we use arrow as intermediate format, it might be the case that `LargeUtf8` or `LargeBinary` types are used by the external writer instead of `Utf8` and `Binary`.
That is why we also need to support below coercions for arrow source types:
- `Utf8 | LargeUtf8` => {text}
- `Binary | LargeBinary` => {bytea}

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 14, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project
with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.

With that we can cast between all allowed arrow types. Some of the examples:
- INT16 => INT32
- UINT32 => INT64
- FLOAT32 => FLOAT64
- LargeUtf8 => UTF8
- LargeBinary => Binary
- Array, and Map with castable fields, e.g. [UINT16] => [INT64]

**Considerations**
- Struct fields are matched by position if a cast applies to it by arrow-cast. This is different than
  how we match table fields by name. This is why we do not allow casting structs yet in this PR.
- Some of the casts are allowed by arrow but they are not allowed by Postgres.
  e.g. INT32 => DATE32 is possible at arrow but not at Postgres. This allows much more flexibility
  to the users but some types can unexpectedly cast to different types.

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 14, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project
with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.

With that we can cast between all allowed arrow types. Some of the examples:
- INT16 => INT32
- UINT32 => INT64
- FLOAT32 => FLOAT64
- LargeUtf8 => UTF8
- LargeBinary => Binary
- Array, and Map with castable fields, e.g. [UINT16] => [INT64]

**Considerations**
- Struct fields are matched by position if a cast applies to it by arrow-cast. This is different than
  how we match table fields by name. This is why we do not allow casting structs yet in this PR.
- Some of the casts are allowed by arrow but they are not allowed by Postgres.
  e.g. INT32 => DATE32 is possible at arrow but not at Postgres. This allows much more flexibility
  to the users but some types can unexpectedly cast to different types.

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 14, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project
with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.

With that we can cast between all allowed arrow types. Some of the examples:
- INT16 => INT32
- UINT32 => INT64
- FLOAT32 => FLOAT64
- LargeUtf8 => UTF8
- LargeBinary => Binary
- Array, and Map with castable fields, e.g. [UINT16] => [INT64]

**Considerations**
- Struct fields are matched by position if a cast applies to it by arrow-cast. This is different than
  how we match table fields by name. This is why we do not allow casting structs yet in this PR.
- Some of the casts are allowed by arrow but they are not allowed by Postgres.
  e.g. INT32 => DATE32 is possible at arrow but not at Postgres. This allows much more flexibility
  to the users but some types can unexpectedly cast to different types.

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 14, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project
with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.

With that we can cast between all allowed arrow types. Some of the examples:
- INT16 => INT32
- UINT32 => INT64
- FLOAT32 => FLOAT64
- LargeUtf8 => UTF8
- LargeBinary => Binary
- Array, and Map with castable fields, e.g. [UINT16] => [INT64]

**Considerations**
- Struct fields are matched by position if a cast applies to it by arrow-cast. This is different than
  how we match table fields by name. This is why we do not allow casting structs yet in this PR.
- Some of the casts are allowed by arrow but they are not allowed by Postgres.
  e.g. INT32 => DATE32 is possible at arrow but not at Postgres. This allows much more flexibility
  to the users but some types can unexpectedly cast to different types.

Closes #67.
aykut-bozkurt added a commit that referenced this issue Nov 14, 2024
`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema.
e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type.
We can avoid this situation by casting arrow array to the array that is expected by the tupledesc
schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project
with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast.

To make sure the cast is possible, we need to do 2 checks:
1. arrow-cast allows the cast from "arrow type at the parquet file" to "arrow type at the schema that is
   generated for tupledesc",
2. the cast is meaningful at Postgres. We check if there is an explicit cast from "Postgres type that corresponds
   for the arrow type at Parquet file" to "Postgres type at tupledesc".

With that we can cast between many castable types as shown below:
- INT16 => INT32
- UINT32 => INT64
- FLOAT32 => FLOAT64
- LargeUtf8 => UTF8
- LargeBinary => Binary
- Struct, Array, and Map with castable fields, e.g. [UINT16] => [INT64] or struct {'x': UINT16} => struct {'x': INT64}

**NOTE**: Struct fields must match by name and position to be cast.

Closes #67.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants