Cast types on read

`COPY FROM parquet` is too strict when matching Postgres tupledesc schema to the parquet file schema. e.g. `INT32` type in the parquet schema cannot be read into a Postgres column with `int64` type. We can avoid this situation by casting arrow array to the array that is expected by the tupledesc schema, if the cast is possible. We can make use of `arrow-cast` crate, which is in the same project with `arrow`. Its public api lets us check if a cast possible between 2 arrow types and perform the cast. With that we can cast between all allowed arrow types. Some of the examples: - INT16 => INT32 - UINT32 => INT64 - FLOAT32 => FLOAT64 - LargeUtf8 => UTF8 - LargeBinary => Binary - Array, and Map with castable fields, e.g. [UINT16] => [INT64] **Considerations** - Struct fields are matched by position if a cast applies to it by arrow-cast. This is different than how we match table fields by name. This is why we do not allow casting structs yet in this PR. - Some of the casts are allowed by arrow but they are not allowed by Postgres. e.g. INT32 => DATE32 is possible at arrow but not at Postgres. This allows much more flexibility to the users but some types can unexpectedly cast to different types. Closes #67.
CrunchyData · Nov 14, 2024 · 984567b · 984567b
1 parent 518a5ac
commit 984567b
Show file tree

Hide file tree

Showing 10 changed files with 912 additions and 287 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -21,6 +21,7 @@ pg_test = []
 
 [dependencies]
 arrow = {version = "53", default-features = false}
+arrow-cast = {version = "53", default-features = false}
 arrow-schema = {version = "53", default-features = false}
 aws-config = { version = "1.5", default-features = false, features = ["rustls"]}
 aws-credential-types = {version = "1.2", default-features = false}