From e517ac4dbe08d518eb5c2e58576d4c711973db94 Mon Sep 17 00:00:00 2001 From: Antoine Pitrou Date: Mon, 18 Mar 2024 11:41:22 +0100 Subject: [PATCH] PARQUET-2414: Extend BYTE_STREAM_SPLIT to support INT32, INT64 and FIXED_LEN_BYTE_ARRAY data (#229) --- CHANGES.md | 6 ++++++ Encodings.md | 5 +++-- src/main/thrift/parquet.thrift | 7 +++++-- 3 files changed, 14 insertions(+), 4 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index 40020004..7bbce7c4 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -19,6 +19,12 @@ # Parquet # +### Version 2.11.0 ### + +#### New Feature + +* [PARQUET-2414](https://issues.apache.org/jira/browse/PARQUET-2414) - Extend BYTE_STREAM_SPLIT to support INT32, INT64 and FIXED_LEN_BYTE_ARRAY data + ### Version 2.10.0 ### #### New Feature diff --git a/Encodings.md b/Encodings.md index 5040094f..ea7e4e36 100644 --- a/Encodings.md +++ b/Encodings.md @@ -337,14 +337,15 @@ Note that, even for FIXED_LEN_BYTE_ARRAY, all lengths are encoded despite the re ### Byte Stream Split: (BYTE_STREAM_SPLIT = 9) -Supported Types: FLOAT, DOUBLE +Supported Types: FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY This encoding does not reduce the size of the data but can lead to a significantly better compression ratio and speed when a compression algorithm is used afterwards. This encoding creates K byte-streams of length N where K is the size in bytes of the data -type and N is the number of elements in the data sequence. Specifically, K is 4 for FLOAT +type and N is the number of elements in the data sequence. For example, K is 4 for FLOAT type and 8 for DOUBLE type. + The bytes of each value are scattered to the corresponding streams. The 0-th byte goes to the 0-th stream, the 1-st byte goes to the 1-st stream and so on. The streams are concatenated in the following order: 0-th stream, 1-st stream, etc. diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 2084ac63..27d40437 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -526,12 +526,15 @@ enum Encoding { */ RLE_DICTIONARY = 8; - /** Encoding for floating-point data. + /** Encoding for fixed-width data (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY). K byte-streams are created where K is the size in bytes of the data type. - The individual bytes of an FP value are scattered to the corresponding stream and + The individual bytes of a value are scattered to the corresponding stream and the streams are concatenated. This itself does not reduce the size of the data but can lead to better compression afterwards. + + Added in 2.8 for FLOAT and DOUBLE. + Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11. */ BYTE_STREAM_SPLIT = 9; }