Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Spark DataSourceV2 to handle Parquet files #455

Open
Aryex opened this issue Jul 6, 2022 · 1 comment
Open

Use Spark DataSourceV2 to handle Parquet files #455

Aryex opened this issue Jul 6, 2022 · 1 comment
Labels
enhancement New feature or request Low Priority

Comments

@Aryex
Copy link
Collaborator

Aryex commented Jul 6, 2022

Descriptions

The connector support Parquet files by reusing some of Spark's lower-level internal systems. This resulted in the connector having to copy over private codes, handle data partitioning, and overall longer codes to maintain.

With Spark 3.0.0 adding support for Parquet DataSourceV2, it could be re-used to handle Parquet files like how JSON was supported in #370. Note that we would still need to look into how writing would be handled.

This could potentially be looked into as part of #403.

Reason: This change will help reduce effort supporting future Spark versions.

@Aryex Aryex added the enhancement New feature or request label Jul 6, 2022
@Aryex Aryex changed the title Use Spark DataSourceV2 Parquet for reading/writing Parquets Use Spark DataSourceV2 Parquet to handle Parquets Jul 6, 2022
@Aryex Aryex changed the title Use Spark DataSourceV2 Parquet to handle Parquets Use Spark DataSourceV2 to handle Parquet files Jul 6, 2022
@alexey-temnikov
Copy link
Collaborator

Lowing priority, as it is relevant only during Spark Upgrade (when spark APIs are changed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Low Priority
Projects
None yet
Development

No branches or pull requests

3 participants