Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Include file paths as column from read_parquet/csv/json #2953

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Sep 26, 2024

Addresses: #2808

This PR enables adding file path as a column from file reads via the file_path_column: str | None parameter. This works by appending a column of the file path to the Table post read + pushdowns.

Having it as a string makes it easy to have unique field name guarantees, i.e. if the user specifies a column name that already exists then an error is thrown.

@github-actions github-actions bot added the enhancement New feature or request label Sep 26, 2024
Copy link

codspeed-hq bot commented Sep 26, 2024

CodSpeed Performance Report

Merging #2953 will not alter performance

Comparing colin/include-path-in-read (945cfc7) with main (c5a6d88)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Sep 26, 2024

Codecov Report

Attention: Patch coverage is 84.68900% with 32 lines in your changes missing coverage. Please review.

Project coverage is 78.43%. Comparing base (c5a6d88) to head (945cfc7).

Files with missing lines Patch % Lines
src/daft-json/src/read.rs 68.57% 11 Missing ⚠️
src/daft-scan/src/lib.rs 70.58% 10 Missing ⚠️
src/daft-plan/src/builder.rs 80.00% 6 Missing ⚠️
src/daft-csv/src/read.rs 93.93% 2 Missing ⚠️
src/daft-micropartition/src/python.rs 50.00% 1 Missing ⚠️
src/daft-parquet/src/python.rs 0.00% 1 Missing ⚠️
src/daft-scan/src/python.rs 87.50% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2953      +/-   ##
==========================================
+ Coverage   78.39%   78.43%   +0.03%     
==========================================
  Files         597      597              
  Lines       69706    69882     +176     
==========================================
+ Hits        54648    54811     +163     
- Misses      15058    15071      +13     
Flag Coverage Δ
78.43% <84.68%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
daft/io/_csv.py 95.65% <ø> (ø)
daft/io/_json.py 91.30% <ø> (ø)
daft/io/_parquet.py 86.20% <ø> (ø)
daft/io/common.py 85.00% <ø> (ø)
src/daft-json/src/local.rs 87.17% <100.00%> (+0.70%) ⬆️
src/daft-local-execution/src/sources/scan_task.rs 91.83% <100.00%> (+0.53%) ⬆️
src/daft-micropartition/src/micropartition.rs 90.86% <100.00%> (+0.11%) ⬆️
src/daft-micropartition/src/ops/cast_to_schema.rs 100.00% <100.00%> (ø)
src/daft-parquet/src/read.rs 75.02% <100.00%> (+0.35%) ⬆️
src/daft-scan/src/anonymous.rs 76.92% <100.00%> (+1.24%) ⬆️
... and 9 more

... and 5 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants