-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] parquet reader refactor, add parquet_stats_reader and parquet_schema_reader (1/2) #1191
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1191 +/- ##
==========================================
- Coverage 88.44% 88.37% -0.07%
==========================================
Files 54 54
Lines 5564 5576 +12
==========================================
+ Hits 4921 4928 +7
- Misses 643 648 +5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
max_request_size: 16 * 1024 * 1024, | ||
split_threshold: 24 * 1024 * 1024, | ||
})); | ||
pub fn read_parquet_statistics(uris: &Series, io_client: Arc<IOClient>) -> DaftResult<Table> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a requirement for us to get the result of this operation as a Table
(a Vec<ParquetFileStatstics>
or Vec<(int, int, str)>
could suffice) but if this avoids more Python glue code then should be fine
io_client, | ||
)? | ||
.into()) | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python-facing functions look good, should work!
row_groups: list[int] | None = None, | ||
file_size: None | int = None, | ||
start_offset: int | None = None, | ||
num_rows: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Table.read_parquet_statistics
to grab basic stats from a series of urlsSchema.from_parquet
to generate Schema from parquet file