-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2310: implementation status #34
Changes from 3 commits
2db6877
b0640f3
b001576
b30536f
6191e82
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
--- | ||
title: "Implementation status" | ||
linkTitle: "Implementation status" | ||
weight: 8 | ||
--- | ||
### Physical types | ||
alippai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
| Data type | C++ | Java | Go | Rust | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure that simply naming our implementations "C++", "Java", etc. is very forward-looking, because at some point e.g. DuckDB might want to add their own info here, and they're also written in C++. Besides, Parquet C++ is also available in Python using PyArrow, in R using R Arrow, and perhaps even in C and Ruby using the GLib bindings. That said, we can also decide to rename the columns later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree -- let's rename the columns later (as we fill out the details) with some name that can be mapped to the implementation ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add a paragraph at the top with a pointer to each implementation (Java, go, cpp, rust, ...) that will make it easy to add more implementations and clarify which one we're talking about. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Proposed addition (targeting this PR) in alippai#1 |
||
| ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| BOOLEAN | | | | | | ||
| INT32 | | | | | | ||
| INT64 | | | | | | ||
| INT96 (1) | | | | | | ||
| FLOAT | | | | | | ||
| DOUBLE | | | | | | ||
| BYTE_ARRAY | | | | | | ||
| FIXED_LEN_BYTE_ARRAY | | | | | | ||
|
||
* \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files | ||
|
||
|
||
### Logical types | ||
|
||
| Data type | C++ | Java | Go | Rust | | ||
| ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| STRING | | | | | | ||
| ENUM | | | | | | ||
alippai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| UUID | | | | | | ||
| 8, 16, 32, 64 bit signed and unsigned INT | | | | | | ||
| DECIMAL (INT32) | | | | | | ||
| DECIMAL (INT64) | | | | | | ||
| DECIMAL (BYTE_ARRAY) | | | | | | ||
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | | ||
| DATE | | | | | | ||
| TIME (INT32) | | | | | | ||
| TIME (INT64) | | | | | | ||
| TIMESTAMP (INT64) | | | | | | ||
| INTERVAL | | | | | | ||
alippai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| JSON | | | | | | ||
| BSON | | | | | | ||
| LIST | | | | | | ||
| MAP | | | | | | ||
| UNKNOWN (always null) | | | | | | ||
| FLOAT16 | | | | | | ||
|
||
### Encodings | ||
|
||
| Encoding | C++ | Java | Go | Rust | | ||
| ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| PLAIN | | | | | | ||
| PLAIN_DICTIONARY | | | | | | ||
| RLE_DICTIONARY | | | | | | ||
| RLE | | | | | | ||
| BIT_PACKED (deprecated) | | | | | | ||
| DELTA_BINARY_PACKED | | | | | | ||
| DELTA_LENGTH_BYTE_ARRAY | | | | | | ||
| DELTA_BYTE_ARRAY | | | | | | ||
| BYTE_STREAM_SPLIT | | | | | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be split into float/double and int/fixed_len_byte_array, or just use notes if an implementation doesn't yet support the expanded set of data types? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we're going to give dates or version numbers as @emkornfield suggested, then this should be split into separate lines. |
||
|
||
### Compressions | ||
|
||
| Compression | C++ | Java | Go | Rust | | ||
| ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| UNCOMPRESSED | | | | | | ||
| BROTLI | | | | | | ||
| GZIP | | | | | | ||
| LZ4 (deprecated) | | | | | | ||
| LZ4_RAW | | | | | | ||
| LZO | | | | | | ||
| SNAPPY | | | | | | ||
| ZSTD | | | | | | ||
|
||
### Other format level features | ||
|
||
| | C++ | Java | Go | Rust | | ||
| ----------------------------------------- | ----- | ------ | ----- | ----- | | ||
| xxxHash-based bloom filters | | | | | | ||
| Bloom filter length (1) | | | | | | ||
| Statistics min_value, max_value | | | | | | ||
| Page index | | | | | | ||
| Page CRC32 checksum | | | | | | ||
| Modular encryption | | | | | | ||
alippai marked this conversation as resolved.
Show resolved
Hide resolved
alippai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Size statistics (2) | | | | | | ||
|
||
|
||
* \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length | ||
|
||
* \(2) In parquet.thrift: ColumnMetaData->size_statistics | ||
|
||
### High level data APIs for Parquet feature usage | ||
|
||
| Format | C++ | Java | Go | Rust | | ||
| -------------------------------------------- | ----- | ------ | ----- | ----- | | ||
| External column data (1) | | | | | | ||
| Row group "Sorting column" metadata (2) | | | | | | ||
| Row group pruning using statistics | | | | | | ||
| Reading select columns only | | | | | | ||
| Page pruning using statistics | | | | | | ||
| Page pruning using bloom filter | | | | | | ||
|
||
|
||
* \(1) In parquet.thrift: ColumnChunk->file_path | ||
|
||
* \(2) In parquet.thrift: RowGroup->sorting_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a legend:
✅ supported
❌ not supported
[blank] no data
The main goal being to clarify the difference between missing information and not supported feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed addition (targeting this PR) in alippai#1