PARQUET-2310: implementation status #34

alippai · 2023-06-20T02:15:48Z

I only can agree with @westonpace:

Although...to play devil's advocate...it might be odd when a feature is available in the parquet reader, but not yet exposed in the query component. For example, there is some row skipping and bloom filters in the C++ parquet reader, but we haven't integrated those into the datasets layer yet.

The original goal was not copying https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features, but quite orthogonal to this: showing which high level APIs are available for the end-user.

What I mean by this is that based on the document I should be able to know if I can manipulate Parquet files without decoding some values or do I have to convert it to actual objects / Apache Arrow table first. This was the info what is missing from all the docs and I'd like maintain something which reflects the public API functionalities.

Many of you were surprised by including xy metadata, what I was referring to was the availability of the https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.metadata or https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.metadata

Similar happens with the Column Index. It's available internally, but if it's not used when adding filters to the parquet reader, that's useful info for the Apache Arrow and Parquet developers, but not for the library users.

This is a limitation of the pyarrow API and this might be temporary or by design, but making this transparent was my intention.

It's absolutely OK if parquet-mr doesn't have a dataset API using the parquet features and it's OK for pyarrow not exposing the raw objects / byte arrays (ever). Different ecosystems have different abstractions and there doesn't have to be a "correct" or "standard" one. This is actually a nice thing - DataFusion, pyarrow, Acero, Iceberg etc have all different communities complementing each other.

Let me know if this is too vague or doesn't belong to the standard docs (or help me achieving a concise summary reflecting the above)

content/en/docs/File Format/implementationstatus.md

pitrou · 2023-06-21T13:50:40Z

@alippai Perhaps you'd like to start filling in the table for Rust?

alippai · 2023-06-21T15:56:52Z

Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources

alamb · 2024-05-07T10:28:26Z

I am very interested in helping this post along. Is there any interest in pursuing this PR or would it be better to make a new one?

Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources

I can certainly fill the one in for Rust

alippai · 2024-05-07T12:15:28Z

I’ll open a new draft later this week

alamb · 2024-05-11T11:37:31Z

FYI #53 is a related conversation. Once that PR merges perhaps there will be a more natural location for this chart / location

alamb

Thank you @alippai

I have a few suggestions

Start the initial PR with a smaller subset of featres (maybe with read/write physical and logical types and compression)
Consider distinguishing read and read/write support for the different features (as I think some implementations can read but not create certain encodings or other features)
Define terms: define the different implementations (e.g. link to what the "Java" implementation means) and define what is required to have the implementation in these charts this list. Perhaps use the list on https://parquet.apache.org/docs/overview/

Smaller Subset

The rationale for a smaller subset is that it will be easier to get consensus. For example, we won't have to try and precisely describe what "modular encryption support" means

"Support"

I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row.

I think there is an eventual goal to have an automated integration test (e.g. @pitrou 's suggestion in the mailing list https://lists.apache.org/thread/xjg1zrhll5ztqr4n5lv92f6tfwvfb8rm)

However, before that perhaps we could define something less formal, like can read the values of certain files in https://github.com/apache/parquet-testing (e.g. alltypes_plain.parquet

alippai · 2024-06-04T12:06:15Z

Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cudf)

pitrou · 2024-06-04T12:11:49Z

IMHO, any currently maintained open source implementation of Parquet deserves mentioning there. But that also requires involvement from their respective maintainers (we shouldn't expect us Parquet maintainers to make sure the information is up to date).

alamb · 2024-06-04T16:42:29Z

Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cudf)

Echoing what @pitrou said, I suggest we add a note somewhere that "this table includes any open source currently maintained implementation of parquet whose maintainers have helped fill it out. If you wish to add a new implementation to this table, please open a PR to do so"

It might also be worth adding a column with some sort of placeholder (? for example) for those implementation (as a way of encouraging their help). However, that might be a good thing to do as a follow on PR as well

alippai · 2024-06-04T16:47:32Z

I like both!

emkornfield · 2024-06-06T04:32:31Z

I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row.

I worry about this becoming a little bit of a pedantic discussion. My 2 cents: I think a reasonable approach is to let maintainers decide on checking the box or not. Maybe we can have a ternary value. If a maintainer feels it is fully supported it is a check. If they think reasonable people might be confused it gets '-' or check without footnote to explain the exception. If it is completely not supported it gets an 'X'. We can always adjust criteria as we get feedback.

alamb · 2024-06-06T12:31:38Z

I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row.

I worry about this becoming a little bit of a pedantic discussion. My 2 cents: I think a reasonable approach is to let maintainers decide on checking the box or not. Maybe we can have a ternary value. If a maintainer feels it is fully supported it is a check. If they think reasonable people might be confused it gets '-' or check without footnote to explain the exception. If it is completely not supported it gets an 'X'. We can always adjust criteria as we get feedback.

I agree that it would be best to start with a relatively lax, low barrier to entry for self reporting

Over time we can add more stringency (ideally with automated checks) if/when that would add value

mhaseeb123 · 2024-06-10T16:54:28Z

IMHO, any currently maintained open source implementation of Parquet deserves mentioning there. But that also requires involvement from their respective maintainers (we shouldn't expect us Parquet maintainers to make sure the information is up to date).

Agreed with that. I would be happy to contribute on cudf's behalf!

content/en/docs/File Format/implementationstatus.md

alamb

Thank you so much @alippai and everyone else who has contributed to this PR.

It is my opinion that we should try and address any outstanding comments that are straightforward, but then merge this PR even if there are other more substantial ones unresolved (like guidance https://github.com/apache/parquet-site/pull/34/files#r1635005006).

We can then iteratively add content over several more PRs and I think some of the comments will be easier to address as we start trying to fill out the chart for implementations.

I worry that we will run out of ambition if we try to complete this page on the first try (the fact that this PR has been open for more than a year is evidence if telling).

I would be happy to help write a pre-amble and/or try to fill out this table for the Rust implementation, for example, as a follow on PR

This is so exciting to see. I rendered this page locally and it looks good.

content/en/docs/File Format/implementationstatus.md

emkornfield · 2024-06-18T03:32:17Z

It is my opinion that we should try and address any outstanding comments that are straightforward, but then merge this PR even if there are other more substantial ones unresolved (like guidance https://github.com/apache/parquet-site/pull/34/files#r1635005006).

I agree with this sentiment. @alippai do you have bandwidth do another pass? I can merge after another pass (or do so now if you don't have bandwidth for an update).

alippai · 2024-06-18T03:57:57Z

Yes, the input is great and I’ll give it a go tomorrow

xhochy · 2024-06-18T08:51:36Z

Please mark PRs that are ready for review as such. I have not given this as a review as it showed up as Draft in my notification. I will do this now in the next 2h.

content/en/docs/File Format/implementationstatus.md

alippai · 2024-06-20T03:21:18Z

I've learned a ton and I see that many of us saw a few surprises. Looks like the page is needed indeed :)

Following the advice a few of you raised I removed any controversial or too detailed items, we'll handle it in a next PR.

Most of the comments are addressed and if we are happy with the changes, I'll:

Rename the existing and add the additional implementations (DuckDB, Impala, Fastparquet etc)
Fill them out with initial values and reach out to the communities (if not present here already)
Create issues for the followup work (like the read/write distinction, query engine or dataset features, date or version based grouping of the features)

etseidl · 2024-06-20T17:25:16Z

@alippai thanks for updates! I agree with @alamb and @emkornfield that it's the right time to merge this and let the various implementations take a crack at filling in their columns. Questions that arise while doing that can drive the next iteration.

emkornfield · 2024-06-21T04:28:50Z

@alippai this sounds like a good plan. I think if you are going to fill out some columns it might make sense to do so in this PR to avoid merge conflicts. It sounds like @alamb and @etseidl want to update for Rust and CUDF respectively but hopefully can coordinate to avoid merge conflicts.

etseidl · 2024-06-21T04:36:26Z

It sounds like @alamb and @etseidl want to update for Rust and CUDF respectively but hopefully can coordinate to avoid merge conflicts.

Actually @mhaseeb123 will be handling cuDF. I'm just an interested bystander 😅.

mhaseeb123 · 2024-06-24T18:23:23Z

It sounds like @alamb and @etseidl want to update for Rust and CUDF respectively but hopefully can coordinate to avoid merge conflicts.

Actually @mhaseeb123 will be handling cuDF. I'm just an interested bystander 😅.

I would be happy to contribute on cuDF's behalf. I am just not fully sure if we are changing column names from language names (C++, Python etc) to implementation names (Arrow, cuDF etc) as @pitrou suggested or filling the rows with implementation names that support a feature in that language?

alippai · 2024-06-25T03:10:02Z

It seems there are no objections. Let me do one more round tomorrow with the update mentioned

alamb

Let's get this PR merged!

I started sketching out what a parquet integration suite might look like here: apache/arrow-rs#5956

Once this PR is merged, i will work on honing the integration suite and use that to fill out this table

alamb · 2024-06-25T13:37:54Z

content/en/docs/File Format/implementationstatus.md

+---
+### Physical types
+
+| Data type                                 | C++   | Java   | Go    | Rust  |


I agree -- let's rename the columns later (as we fill out the details) with some name that can be mapped to the implementation (e.g. parquet-cpp, parquet-java`, etc)

julienledem · 2024-06-26T13:52:41Z

content/en/docs/File Format/implementationstatus.md

+title: "Implementation status"
+linkTitle: "Implementation status"
+weight: 8
+---


Maybe add a legend:
✅ supported
❌ not supported
[blank] no data

The main goal being to clarify the difference between missing information and not supported feature.

Proposed addition (targeting this PR) in alippai#1

alamb · 2024-06-26T15:51:55Z

I tried to address @julienledem 's comments in alippai#1

Add introduction paragraph and legend

mhaseeb123 · 2024-06-27T21:56:24Z

Hi @alamb, thank you for the all work and addressing comments. I saw you added a legend and some description in alippai#1 but I don't see cuDF in there and also am not sure how to exactly fill the table for cuDF. This is confusing especially as cuDF provides C++, Python and (limited? @etseidl) Java Parquet interfaces with GPU-accelerated backends.

Please do note that I am in favor of merging the PR and making things live atm and thinking about this later in a separate PR as well.

emkornfield · 2024-06-28T04:56:30Z

@alippai were you planning on adding anything else here or should we merge?

alamb · 2024-06-28T10:15:37Z

I saw you added a legend and some description in alippai#1 but I don't see cuDF in there and also am not sure how to exactly fill the table for cuDF. This is confusing especially as cuDF provides C++, Python and (limited? @etseidl) Java Parquet interfaces with GPU-accelerated backends.

My suggestion is to add a column/columns for cuDF in a future PR when you have information to fill out.

I think the intent of these tables is to describe the parquet implementations (so it lists parquet-cpp, not its python bindings in pyarrow, for example). Perhaps the same model would apply to the parquet reader in cuDF

In terms of how to fill out the table, we are discussing various option for "what does it mean to support a feature" on apache/parquet-format#441

emkornfield · 2024-07-04T17:01:40Z

@alippai thank you for getting this sarted, I'm going to merge now to let others try to fill it out.

This was referenced Jun 20, 2023

GH-36028: [Docs][Parquet] Detailed parquet format support and parquet integration status apache/arrow#36027

Closed

[Docs][Parquet] Document Parquet implementation status apache/arrow#36028

Closed

pitrou reviewed Jun 20, 2023

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

pitrou reviewed Jun 20, 2023

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

westonpace reviewed Jun 20, 2023

View reviewed changes

pitrou reviewed Jun 21, 2023

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

pitrou reviewed Jun 21, 2023

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

pitrou reviewed Jun 21, 2023

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

pitrou reviewed Jun 21, 2023

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

alamb mentioned this pull request May 14, 2024

First draft of docs about parquet format vs mr #53

Merged

alippai force-pushed the PARQUET-2310-implementation-status branch 3 times, most recently from 9ad826b to 6b26e2e Compare June 2, 2024 01:56

Add implementation status tables

2db6877

alippai force-pushed the PARQUET-2310-implementation-status branch from 6b26e2e to 2db6877 Compare June 2, 2024 01:58

alamb reviewed Jun 4, 2024

View reviewed changes

emkornfield reviewed Jun 10, 2024

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

emkornfield reviewed Jun 10, 2024

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

alamb approved these changes Jun 14, 2024

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

mhaseeb123 approved these changes Jun 14, 2024

View reviewed changes

alkis reviewed Jun 18, 2024

View reviewed changes

jorisvandenbossche reviewed Jun 18, 2024

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

Address review comments

b0640f3

alippai marked this pull request as ready for review June 20, 2024 03:21

Sort compressions

b001576

asfimport mentioned this pull request Jun 18, 2024

[Doc] Add implementation status / matrix apache/parquet-format#419

Open

alamb mentioned this pull request Jun 25, 2024

Prototype parquet-integration-testing integration tests apache/arrow-rs#5956

Draft

alamb approved these changes Jun 25, 2024

View reviewed changes

julienledem requested changes Jun 26, 2024

View reviewed changes

alamb mentioned this pull request Jun 26, 2024

Parquet compatibility / integration testing apache/parquet-format#441

Open

Add introduction paragraph and legend

b30536f

alamb mentioned this pull request Jun 26, 2024

Add introduction paragraph and legend alippai/parquet-site#1

Merged

Merge pull request #1 from alamb/alamb/implementation_status

6191e82

Add introduction paragraph and legend

emkornfield merged commit 19eb00f into apache:production Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2310: implementation status #34

PARQUET-2310: implementation status #34

alippai commented Jun 20, 2023 •

edited

Loading

pitrou commented Jun 21, 2023

alippai commented Jun 21, 2023

alamb commented May 7, 2024

alippai commented May 7, 2024

alamb commented May 11, 2024

alamb left a comment

alippai commented Jun 4, 2024 •

edited

Loading

pitrou commented Jun 4, 2024

alamb commented Jun 4, 2024

alippai commented Jun 4, 2024

emkornfield commented Jun 6, 2024

alamb commented Jun 6, 2024

mhaseeb123 commented Jun 10, 2024

alamb left a comment

emkornfield commented Jun 18, 2024

alippai commented Jun 18, 2024

xhochy commented Jun 18, 2024

alippai commented Jun 20, 2024

etseidl commented Jun 20, 2024

emkornfield commented Jun 21, 2024

etseidl commented Jun 21, 2024

mhaseeb123 commented Jun 24, 2024

alippai commented Jun 25, 2024

alamb left a comment

alamb Jun 25, 2024

julienledem Jun 26, 2024

alamb Jun 26, 2024

alamb commented Jun 26, 2024

mhaseeb123 commented Jun 27, 2024 •

edited

Loading

emkornfield commented Jun 28, 2024

alamb commented Jun 28, 2024

emkornfield commented Jul 4, 2024

PARQUET-2310: implementation status #34

PARQUET-2310: implementation status #34

Conversation

alippai commented Jun 20, 2023 • edited Loading

pitrou commented Jun 21, 2023

alippai commented Jun 21, 2023

alamb commented May 7, 2024

alippai commented May 7, 2024

alamb commented May 11, 2024

alamb left a comment

Choose a reason for hiding this comment

Smaller Subset

"Support"

alippai commented Jun 4, 2024 • edited Loading

pitrou commented Jun 4, 2024

alamb commented Jun 4, 2024

alippai commented Jun 4, 2024

emkornfield commented Jun 6, 2024

alamb commented Jun 6, 2024

mhaseeb123 commented Jun 10, 2024

alamb left a comment

Choose a reason for hiding this comment

emkornfield commented Jun 18, 2024

alippai commented Jun 18, 2024

xhochy commented Jun 18, 2024

alippai commented Jun 20, 2024

etseidl commented Jun 20, 2024

emkornfield commented Jun 21, 2024

etseidl commented Jun 21, 2024

mhaseeb123 commented Jun 24, 2024

alippai commented Jun 25, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 25, 2024

Choose a reason for hiding this comment

julienledem Jun 26, 2024

Choose a reason for hiding this comment

alamb Jun 26, 2024

Choose a reason for hiding this comment

alamb commented Jun 26, 2024

mhaseeb123 commented Jun 27, 2024 • edited Loading

emkornfield commented Jun 28, 2024

alamb commented Jun 28, 2024

emkornfield commented Jul 4, 2024

alippai commented Jun 20, 2023 •

edited

Loading

alippai commented Jun 4, 2024 •

edited

Loading

mhaseeb123 commented Jun 27, 2024 •

edited

Loading