Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task: groupby (i.e. "split") #109

Closed
cisaacstern opened this issue Jul 29, 2024 · 1 comment
Closed

Task: groupby (i.e. "split") #109

cisaacstern opened this issue Jul 29, 2024 · 1 comment
Assignees

Comments

@cisaacstern
Copy link
Collaborator

cisaacstern commented Jul 29, 2024

This task uses pandas.Dataframe.groupby internally to apply the groupers defined by #72, and split an input dataframe into groups for processing.

It may be easiest to do this together with #72 since these features are closely linked.

Depending on the setting (local, cloud) and/or workflow configuration, this task may emit either actual dataframe groups, or possibly references (urls) to serialized groups (i.e. hive-partitioned parquet). The latter is demonstrated in #45.

We should also leave room for the eventual scenario in which the data may be live in a backed that supports SQL queries (BigQuery, DuckDB, etc.), in which case the groupby may not need to happen in-memory. This is not relevant to our immediate concerns but, if there's a way to do so that doesn't take much extra time, it may be useful to make a nod to this idea somewhere in this implementation (a NotImplemented code path, etc.) so we keep it in mind.

@cisaacstern
Copy link
Collaborator Author

closed by #162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant