Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] DataFrame.__iter__() and .iter_partitions() #1062

Merged
merged 6 commits into from
Jun 22, 2023
Merged

Conversation

xcharleslin
Copy link
Contributor

@xcharleslin xcharleslin commented Jun 17, 2023

Implements iteration, via streaming execution, over DataFrames:

  • DataFrame.iter() returns an iterator of rows. Each row is a pydict of the form {"colname": value }.
  • DataFrame.iter_partitions() returns an iterator of partitions. Each partition is a daft.Table object.

Execution semantics:

  • Results are returned as soon as they become available.
  • Current behaviours (not technical restrictions, we can change these if we want):
    • PyRunner: Execution pauses between calls to iterator.next().
    • RayRunner: Execution continues in the background.

Implementation details:

  • Adds new interfaces to Runner:
    • run_iter() -> Iterator[PartitionT] and
    • run_iter_tables() -> Iterator[Table]
  • in addition to the existing Runner.run() -> PartitionCacheEntry. This isn't super clean - ideally we go through a single point of abstraction (PartitionCache) for translating between PartitionT and Table. But we may rewrite a lot of this soon anyway, and for now it is a bit dangerous to shoehorn single-partition behaviour into a PartitionSet.
  • run_iter() is now the new narrow waist. All execution, even df.collect(), now happens through streaming execution.

@codecov
Copy link

codecov bot commented Jun 17, 2023

Codecov Report

Merging #1062 (39f183b) into main (1c49647) will increase coverage by 0.24%.
The diff coverage is 92.77%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1062      +/-   ##
==========================================
+ Coverage   88.64%   88.89%   +0.24%     
==========================================
  Files          54       54              
  Lines        5382     5447      +65     
==========================================
+ Hits         4771     4842      +71     
+ Misses        611      605       -6     
Impacted Files Coverage Δ
daft/runners/runner.py 79.31% <80.00%> (-0.69%) ⬇️
daft/runners/ray_runner.py 92.09% <92.30%> (+0.08%) ⬆️
daft/dataframe/dataframe.py 89.85% <95.45%> (+0.26%) ⬆️
daft/runners/pyrunner.py 94.76% <100.00%> (+0.15%) ⬆️

... and 1 file with indirect coverage changes

@xcharleslin xcharleslin changed the title Charles/df iter 2 [FEAT] DataFrame.__iter__() and .iter_partitions() Jun 22, 2023
@github-actions github-actions bot added the enhancement New feature or request label Jun 22, 2023
@xcharleslin xcharleslin marked this pull request as ready for review June 22, 2023 04:08
@xcharleslin xcharleslin merged commit eb7e2c8 into main Jun 22, 2023
@xcharleslin xcharleslin deleted the charles/df-iter-2 branch June 22, 2023 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant