Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Enable feature-flagged native downloader in daft.read_parquet #1190

Merged
merged 13 commits into from
Jul 31, 2023

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Jul 25, 2023

Enables changes to daft.read_parquet such that when we specify use_native_downloader=True, this uses our new Rust-based native Parquet downloading and parsing for:

  1. Schema inference
  2. Retrieving per-file metadata (currently only the number of rows)
  3. Reading the actual Table

@github-actions github-actions bot added the enhancement New feature or request label Jul 25, 2023
@codecov
Copy link

codecov bot commented Jul 25, 2023

Codecov Report

Merging #1190 (34423d5) into main (bacd70e) will increase coverage by 0.02%.
The diff coverage is 81.57%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1190      +/-   ##
==========================================
+ Coverage   88.37%   88.40%   +0.02%     
==========================================
  Files          54       55       +1     
  Lines        5576     5606      +30     
==========================================
+ Hits         4928     4956      +28     
- Misses        648      650       +2     
Files Changed Coverage Δ
daft/execution/execution_step.py 94.07% <ø> (ø)
daft/runners/runner_io.py 91.66% <ø> (ø)
daft/io/config.py 62.50% <62.50%> (ø)
daft/io/parquet.py 93.33% <66.66%> (-6.67%) ⬇️
daft/table/schema_inference.py 94.59% <75.00%> (-2.38%) ⬇️
daft/datasources.py 93.93% <80.00%> (-2.49%) ⬇️
daft/table/table_io.py 94.05% <85.71%> (-0.68%) ⬇️
daft/filesystem.py 86.07% <100.00%> (+0.45%) ⬆️
daft/io/__init__.py 100.00% <100.00%> (ø)
daft/table/table.py 89.21% <100.00%> (+1.71%) ⬆️

... and 1 file with indirect coverage changes

@jaychia jaychia force-pushed the jay/parquet-pyarrow-switch-to-table branch from e118673 to 1e5309e Compare July 27, 2023 03:07
@jaychia jaychia changed the title [FEAT] Enable stub code for native downloader in daft.read_parquet [FEAT] Enable feature-flagged native downloader in daft.read_parquet Jul 27, 2023
parquet_statistics = Table.read_parquet_statistics(
list(filepaths_to_infos.keys()), source_info.io_config
).to_pydict()
for path, num_rows in zip(parquet_statistics["uris"], parquet_statistics["row_count"]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This code relies on hardcoded strings which are the names of the table's series. A little error prone perhaps?

@@ -364,6 +364,6 @@ def read_parquet_statistics(
io_config: IOConfig | None = None,
) -> Table:
if not isinstance(paths, Series):
paths = Series.from_pylist(paths)

paths = Series.from_pylist(paths, name="uris")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Our Series objects default to the name "list_col". Modified here to have a better name, but needs to be consistent across calls and so I added an assert after.

@@ -34,6 +34,20 @@ impl PyIOConfig {
config: self.config.s3.clone(),
})
}

pub fn __setstate__(&mut self, py: Python, state: PyObject) -> PyResult<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to make this class pickleable for our optimizer's deepcopy

@jaychia jaychia requested a review from samster25 July 27, 2023 04:05
@jaychia jaychia enabled auto-merge (squash) July 31, 2023 17:24
@jaychia jaychia merged commit f43174d into main Jul 31, 2023
17 of 18 checks passed
@jaychia jaychia deleted the jay/parquet-pyarrow-switch-to-table branch July 31, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant