Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2310: implementation status #34

Merged

Conversation

alippai
Copy link
Contributor

@alippai alippai commented Jun 20, 2023

Moved from: apache/arrow#36027

I only can agree with @westonpace:

Although...to play devil's advocate...it might be odd when a feature is available in the parquet reader, but not yet exposed in the query component. For example, there is some row skipping and bloom filters in the C++ parquet reader, but we haven't integrated those into the datasets layer yet.

The original goal was not copying https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features, but quite orthogonal to this: showing which high level APIs are available for the end-user.

What I mean by this is that based on the document I should be able to know if I can manipulate Parquet files without decoding some values or do I have to convert it to actual objects / Apache Arrow table first. This was the info what is missing from all the docs and I'd like maintain something which reflects the public API functionalities.

Many of you were surprised by including xy metadata, what I was referring to was the availability of the https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.metadata or https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.metadata

Similar happens with the Column Index. It's available internally, but if it's not used when adding filters to the parquet reader, that's useful info for the Apache Arrow and Parquet developers, but not for the library users.

This is a limitation of the pyarrow API and this might be temporary or by design, but making this transparent was my intention.

It's absolutely OK if parquet-mr doesn't have a dataset API using the parquet features and it's OK for pyarrow not exposing the raw objects / byte arrays (ever). Different ecosystems have different abstractions and there doesn't have to be a "correct" or "standard" one. This is actually a nice thing - DataFusion, pyarrow, Acero, Iceberg etc have all different communities complementing each other.

Let me know if this is too vague or doesn't belong to the standard docs (or help me achieving a concise summary reflecting the above)

@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

@alippai Perhaps you'd like to start filling in the table for Rust?

@alippai
Copy link
Contributor Author

alippai commented Jun 21, 2023

Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources

@alamb
Copy link
Collaborator

alamb commented May 7, 2024

I am very interested in helping this post along. Is there any interest in pursuing this PR or would it be better to make a new one?

Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources

I can certainly fill the one in for Rust

@alippai
Copy link
Contributor Author

alippai commented May 7, 2024

I’ll open a new draft later this week

@alamb
Copy link
Collaborator

alamb commented May 11, 2024

FYI #53 is a related conversation. Once that PR merges perhaps there will be a more natural location for this chart / location

@alippai alippai force-pushed the PARQUET-2310-implementation-status branch 3 times, most recently from 9ad826b to 6b26e2e Compare June 2, 2024 01:56
@alippai alippai force-pushed the PARQUET-2310-implementation-status branch from 6b26e2e to 2db6877 Compare June 2, 2024 01:58
Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alippai

I have a few suggestions

  1. Start the initial PR with a smaller subset of featres (maybe with read/write physical and logical types and compression)
  2. Consider distinguishing read and read/write support for the different features (as I think some implementations can read but not create certain encodings or other features)
  3. Define terms: define the different implementations (e.g. link to what the "Java" implementation means) and define what is required to have the implementation in these charts this list. Perhaps use the list on https://parquet.apache.org/docs/overview/

Smaller Subset

The rationale for a smaller subset is that it will be easier to get consensus. For example, we won't have to try and precisely describe what "modular encryption support" means

"Support"

I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row.

I think there is an eventual goal to have an automated integration test (e.g. @pitrou 's suggestion in the mailing list https://lists.apache.org/thread/xjg1zrhll5ztqr4n5lv92f6tfwvfb8rm)

However, before that perhaps we could define something less formal, like can read the values of certain files in https://github.com/apache/parquet-testing (e.g. alltypes_plain.parquet

@alippai
Copy link
Contributor Author

alippai commented Jun 4, 2024

Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cudf)

@pitrou
Copy link
Member

pitrou commented Jun 4, 2024

IMHO, any currently maintained open source implementation of Parquet deserves mentioning there. But that also requires involvement from their respective maintainers (we shouldn't expect us Parquet maintainers to make sure the information is up to date).

@alamb
Copy link
Collaborator

alamb commented Jun 4, 2024

Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cudf)

Echoing what @pitrou said, I suggest we add a note somewhere that "this table includes any open source currently maintained implementation of parquet whose maintainers have helped fill it out. If you wish to add a new implementation to this table, please open a PR to do so"

It might also be worth adding a column with some sort of placeholder (? for example) for those implementation (as a way of encouraging their help). However, that might be a good thing to do as a follow on PR as well

@alippai
Copy link
Contributor Author

alippai commented Jun 4, 2024

I like both!

@emkornfield
Copy link
Contributor

I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row.

I worry about this becoming a little bit of a pedantic discussion. My 2 cents: I think a reasonable approach is to let maintainers decide on checking the box or not. Maybe we can have a ternary value. If a maintainer feels it is fully supported it is a check. If they think reasonable people might be confused it gets '-' or check without footnote to explain the exception. If it is completely not supported it gets an 'X'. We can always adjust criteria as we get feedback.

@alamb
Copy link
Collaborator

alamb commented Jun 6, 2024

I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row.

I worry about this becoming a little bit of a pedantic discussion. My 2 cents: I think a reasonable approach is to let maintainers decide on checking the box or not. Maybe we can have a ternary value. If a maintainer feels it is fully supported it is a check. If they think reasonable people might be confused it gets '-' or check without footnote to explain the exception. If it is completely not supported it gets an 'X'. We can always adjust criteria as we get feedback.

I agree that it would be best to start with a relatively lax, low barrier to entry for self reporting

Over time we can add more stringency (ideally with automated checks) if/when that would add value

@mhaseeb123
Copy link

IMHO, any currently maintained open source implementation of Parquet deserves mentioning there. But that also requires involvement from their respective maintainers (we shouldn't expect us Parquet maintainers to make sure the information is up to date).

Agreed with that. I would be happy to contribute on cudf's behalf!

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @alippai and everyone else who has contributed to this PR.

It is my opinion that we should try and address any outstanding comments that are straightforward, but then merge this PR even if there are other more substantial ones unresolved (like guidance https://github.com/apache/parquet-site/pull/34/files#r1635005006).

We can then iteratively add content over several more PRs and I think some of the comments will be easier to address as we start trying to fill out the chart for implementations.

I worry that we will run out of ambition if we try to complete this page on the first try (the fact that this PR has been open for more than a year is evidence if telling).

I would be happy to help write a pre-amble and/or try to fill out this table for the Rust implementation, for example, as a follow on PR

This is so exciting to see. I rendered this page locally and it looks good.
Screenshot 2024-06-14 at 6 08 47 AM

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor

It is my opinion that we should try and address any outstanding comments that are straightforward, but then merge this PR even if there are other more substantial ones unresolved (like guidance https://github.com/apache/parquet-site/pull/34/files#r1635005006).

I agree with this sentiment. @alippai do you have bandwidth do another pass? I can merge after another pass (or do so now if you don't have bandwidth for an update).

@alippai
Copy link
Contributor Author

alippai commented Jun 18, 2024

Yes, the input is great and I’ll give it a go tomorrow

@xhochy
Copy link
Member

xhochy commented Jun 18, 2024

Please mark PRs that are ready for review as such. I have not given this as a review as it showed up as Draft in my notification. I will do this now in the next 2h.

@alippai
Copy link
Contributor Author

alippai commented Jun 20, 2024

I've learned a ton and I see that many of us saw a few surprises. Looks like the page is needed indeed :)

Following the advice a few of you raised I removed any controversial or too detailed items, we'll handle it in a next PR.

Most of the comments are addressed and if we are happy with the changes, I'll:

  1. Rename the existing and add the additional implementations (DuckDB, Impala, Fastparquet etc)
  2. Fill them out with initial values and reach out to the communities (if not present here already)
  3. Create issues for the followup work (like the read/write distinction, query engine or dataset features, date or version based grouping of the features)

@alippai alippai marked this pull request as ready for review June 20, 2024 03:21
@etseidl
Copy link
Contributor

etseidl commented Jun 20, 2024

@alippai thanks for updates! I agree with @alamb and @emkornfield that it's the right time to merge this and let the various implementations take a crack at filling in their columns. Questions that arise while doing that can drive the next iteration.

@emkornfield
Copy link
Contributor

@alippai this sounds like a good plan. I think if you are going to fill out some columns it might make sense to do so in this PR to avoid merge conflicts. It sounds like @alamb and @etseidl want to update for Rust and CUDF respectively but hopefully can coordinate to avoid merge conflicts.

@etseidl
Copy link
Contributor

etseidl commented Jun 21, 2024

It sounds like @alamb and @etseidl want to update for Rust and CUDF respectively but hopefully can coordinate to avoid merge conflicts.

Actually @mhaseeb123 will be handling cuDF. I'm just an interested bystander 😅.

@mhaseeb123
Copy link

It sounds like @alamb and @etseidl want to update for Rust and CUDF respectively but hopefully can coordinate to avoid merge conflicts.

Actually @mhaseeb123 will be handling cuDF. I'm just an interested bystander 😅.

I would be happy to contribute on cuDF's behalf. I am just not fully sure if we are changing column names from language names (C++, Python etc) to implementation names (Arrow, cuDF etc) as @pitrou suggested or filling the rows with implementation names that support a feature in that language?

@alippai
Copy link
Contributor Author

alippai commented Jun 25, 2024

It seems there are no objections. Let me do one more round tomorrow with the update mentioned

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this PR merged!

I started sketching out what a parquet integration suite might look like here: apache/arrow-rs#5956

Once this PR is merged, i will work on honing the integration suite and use that to fill out this table

---
### Physical types

| Data type | C++ | Java | Go | Rust |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- let's rename the columns later (as we fill out the details) with some name that can be mapped to the implementation (e.g. parquet-cpp, parquet-java`, etc)

title: "Implementation status"
linkTitle: "Implementation status"
weight: 8
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a legend:
✅ supported
❌ not supported
[blank] no data

The main goal being to clarify the difference between missing information and not supported feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed addition (targeting this PR) in alippai#1

@alamb
Copy link
Collaborator

alamb commented Jun 26, 2024

I tried to address @julienledem 's comments in alippai#1

Add introduction paragraph and legend
@mhaseeb123
Copy link

mhaseeb123 commented Jun 27, 2024

Hi @alamb, thank you for the all work and addressing comments. I saw you added a legend and some description in alippai#1 but I don't see cuDF in there and also am not sure how to exactly fill the table for cuDF. This is confusing especially as cuDF provides C++, Python and (limited? @etseidl) Java Parquet interfaces with GPU-accelerated backends.

Please do note that I am in favor of merging the PR and making things live atm and thinking about this later in a separate PR as well.

@emkornfield
Copy link
Contributor

@alippai were you planning on adding anything else here or should we merge?

@alamb
Copy link
Collaborator

alamb commented Jun 28, 2024

I saw you added a legend and some description in alippai#1 but I don't see cuDF in there and also am not sure how to exactly fill the table for cuDF. This is confusing especially as cuDF provides C++, Python and (limited? @etseidl) Java Parquet interfaces with GPU-accelerated backends.

My suggestion is to add a column/columns for cuDF in a future PR when you have information to fill out.

I think the intent of these tables is to describe the parquet implementations (so it lists parquet-cpp, not its python bindings in pyarrow, for example). Perhaps the same model would apply to the parquet reader in cuDF

In terms of how to fill out the table, we are discussing various option for "what does it mean to support a feature" on apache/parquet-format#441

@emkornfield
Copy link
Contributor

@alippai thank you for getting this sarted, I'm going to merge now to let others try to fill it out.

@emkornfield emkornfield merged commit 19eb00f into apache:production Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.