Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2310: implementation status #34

Merged
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions content/en/docs/File Format/implementationstatus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: "Implementation status"
linkTitle: "Implementation status"
weight: 8
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a legend:
✅ supported
❌ not supported
[blank] no data

The main goal being to clarify the difference between missing information and not supported feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed addition (targeting this PR) in alippai#1

### Physical types
alippai marked this conversation as resolved.
Show resolved Hide resolved

| Data type | C++ | Java | Go | Rust |
Copy link
Member

@pitrou pitrou Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that simply naming our implementations "C++", "Java", etc. is very forward-looking, because at some point e.g. DuckDB might want to add their own info here, and they're also written in C++.

Besides, Parquet C++ is also available in Python using PyArrow, in R using R Arrow, and perhaps even in C and Ruby using the GLib bindings.

That said, we can also decide to rename the columns later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- let's rename the columns later (as we fill out the details) with some name that can be mapped to the implementation (e.g. parquet-cpp, parquet-java`, etc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a paragraph at the top with a pointer to each implementation (Java, go, cpp, rust, ...) that will make it easy to add more implementations and clarify which one we're talking about.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed addition (targeting this PR) in alippai#1

| ----------------------------------------- | ----- | ------ | ----- | ----- |
| BOOLEAN | | | | |
| INT32 | | | | |
| INT64 | | | | |
| INT96 (1) | | | | |
| FLOAT | | | | |
| DOUBLE | | | | |
| BYTE_ARRAY | | | | |
| FIXED_LEN_BYTE_ARRAY | | | | |

* \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files


### Logical types

| Data type | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| STRING | | | | |
| ENUM | | | | |
alippai marked this conversation as resolved.
Show resolved Hide resolved
| UUID | | | | |
| 8, 16, 32, 64 bit signed and unsigned INT | | | | |
| DECIMAL (INT32) | | | | |
| DECIMAL (INT64) | | | | |
| DECIMAL (BYTE_ARRAY) | | | | |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | |
| DATE | | | | |
| TIME (INT32) | | | | |
| TIME (INT64) | | | | |
| TIMESTAMP (INT64) | | | | |
| INTERVAL | | | | |
alippai marked this conversation as resolved.
Show resolved Hide resolved
| JSON | | | | |
| BSON | | | | |
| LIST | | | | |
| MAP | | | | |
| UNKNOWN (always null) | | | | |
| FLOAT16 | | | | |

### Encodings

| Encoding | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| PLAIN | | | | |
| PLAIN_DICTIONARY | | | | |
| RLE_DICTIONARY | | | | |
| RLE | | | | |
| BIT_PACKED (deprecated) | | | | |
| DELTA_BINARY_PACKED | | | | |
| DELTA_LENGTH_BYTE_ARRAY | | | | |
| DELTA_BYTE_ARRAY | | | | |
| BYTE_STREAM_SPLIT | | | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be split into float/double and int/fixed_len_byte_array, or just use notes if an implementation doesn't yet support the expanded set of data types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to give dates or version numbers as @emkornfield suggested, then this should be split into separate lines.


### Compressions

| Compression | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| UNCOMPRESSED | | | | |
| BROTLI | | | | |
| GZIP | | | | |
| LZ4 (deprecated) | | | | |
| LZ4_RAW | | | | |
| LZO | | | | |
| SNAPPY | | | | |
| ZSTD | | | | |

### Other format level features

| | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| xxxHash-based bloom filters | | | | |
| Bloom filter length (1) | | | | |
| Statistics min_value, max_value | | | | |
| Page index | | | | |
| Page CRC32 checksum | | | | |
| Modular encryption | | | | |
alippai marked this conversation as resolved.
Show resolved Hide resolved
alippai marked this conversation as resolved.
Show resolved Hide resolved
| Size statistics (2) | | | | |


* \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length

* \(2) In parquet.thrift: ColumnMetaData->size_statistics

### High level data APIs for Parquet feature usage

| Format | C++ | Java | Go | Rust |
| -------------------------------------------- | ----- | ------ | ----- | ----- |
| External column data (1) | | | | |
| Row group "Sorting column" metadata (2) | | | | |
| Row group pruning using statistics | | | | |
| Reading select columns only | | | | |
| Page pruning using statistics | | | | |
| Page pruning using bloom filter | | | | |


* \(1) In parquet.thrift: ColumnChunk->file_path

* \(2) In parquet.thrift: RowGroup->sorting_columns