[FEAT] Add `DataFrame.to_torch_map_dataset` and `.to_torch_iter_dataset`. #1086

xcharleslin · 2023-06-23T01:11:29Z

Adds two new top-level APIs to DataFrame: to_torch_map_dataset and to_torch_iter_dataset, that returns respective PyTorch datasets (TODO: make these DataPipes).

to_torch_map_dataset will execute the whole dataframe before returning, since it needs random access.
to_torch_iter_dataset will return immediately; results are returned via streaming execution.

Both are only meant for use in a single-node setting, with Ray Datasets being the recommended data loading abstraction for distributed training.

codecov · 2023-06-23T01:18:28Z

Codecov Report

Merging #1086 (2aeb22e) into main (cd59693) will decrease coverage by 0.55%.
The diff coverage is 10.52%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1086      +/-   ##
==========================================
- Coverage   88.87%   88.33%   -0.55%     
==========================================
  Files          53       54       +1     
  Lines        5439     5477      +38     
==========================================
+ Hits         4834     4838       +4     
- Misses        605      639      +34

Impacted Files	Coverage Δ
daft/dataframe/to_torch.py	`0.00% <0.00%> (ø)`
daft/dataframe/dataframe.py	`88.84% <40.00%> (-1.02%)`	⬇️

clarkzinzow

@xcharleslin Meta comment: we might want to subclass the new canonical Torch data abstractions, IterDataPipe and MapDataPipe.

xcharleslin · 2023-06-23T02:54:03Z

@clarkzinzow Hahaha, was just gonna ask for your input on this one :) I'll do that, thanks!

Do the API naming and docs look good to you as well?

Add DataFrame.to_torch_dataset and to_torch_iter_dataset.

721e688

github-actions bot added the enhancement New feature or request label Jun 23, 2023

clarkzinzow reviewed Jun 23, 2023

View reviewed changes

Docs updates

b0d94c8

github-actions bot added the documentation Improvements or additions to documentation label Jun 23, 2023

xcharleslin changed the title ~~[FEAT] Add DataFrame.to_torch_dataset and to_torch_iter_dataset.~~ [FEAT] Add DataFrame.to_torch_map_dataset and .to_torch_iter_dataset. Jun 23, 2023

Prefer torchdata DataPipes over torch Datasets

2aeb22e

xcharleslin marked this pull request as ready for review June 23, 2023 20:44

xcharleslin enabled auto-merge (squash) June 23, 2023 20:44

xcharleslin merged commit e101ed6 into main Jun 23, 2023

xcharleslin deleted the charles/torch-dataset branch June 23, 2023 21:00

jaychia added the highlight label Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add `DataFrame.to_torch_map_dataset` and `.to_torch_iter_dataset`. #1086

[FEAT] Add `DataFrame.to_torch_map_dataset` and `.to_torch_iter_dataset`. #1086

xcharleslin commented Jun 23, 2023 •

edited

Loading

codecov bot commented Jun 23, 2023 •

edited

Loading

clarkzinzow left a comment

xcharleslin commented Jun 23, 2023

[FEAT] Add DataFrame.to_torch_map_dataset and .to_torch_iter_dataset. #1086

[FEAT] Add DataFrame.to_torch_map_dataset and .to_torch_iter_dataset. #1086

Conversation

xcharleslin commented Jun 23, 2023 • edited Loading

codecov bot commented Jun 23, 2023 • edited Loading

Codecov Report

clarkzinzow left a comment

Choose a reason for hiding this comment

xcharleslin commented Jun 23, 2023

[FEAT] Add `DataFrame.to_torch_map_dataset` and `.to_torch_iter_dataset`. #1086

[FEAT] Add `DataFrame.to_torch_map_dataset` and `.to_torch_iter_dataset`. #1086

xcharleslin commented Jun 23, 2023 •

edited

Loading

codecov bot commented Jun 23, 2023 •

edited

Loading