Task: groupby (i.e. "split") #109

cisaacstern · 2024-07-29T22:57:13Z

This task uses pandas.Dataframe.groupby internally to apply the groupers defined by #72, and split an input dataframe into groups for processing.

It may be easiest to do this together with #72 since these features are closely linked.

Depending on the setting (local, cloud) and/or workflow configuration, this task may emit either actual dataframe groups, or possibly references (urls) to serialized groups (i.e. hive-partitioned parquet). The latter is demonstrated in #45.

We should also leave room for the eventual scenario in which the data may be live in a backed that supports SQL queries (BigQuery, DuckDB, etc.), in which case the groupby may not need to happen in-memory. This is not relevant to our immediate concerns but, if there's a way to do so that doesn't take much extra time, it may be useful to make a nod to this idea somewhere in this implementation (a NotImplemented code path, etc.) so we keep it in mind.

cisaacstern · 2024-08-08T19:17:12Z

closed by #162

cisaacstern self-assigned this Jul 29, 2024

This was referenced Jul 31, 2024

Yaml spec expressiveness #90

Merged

Set groupers, split groups (and mode="mapvalues") #120

Closed

cisaacstern mentioned this issue Aug 7, 2024

Tasks: set groupers + split groups #160

Merged

cisaacstern closed this as completed Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task: groupby (i.e. "split") #109

Task: groupby (i.e. "split") #109

cisaacstern commented Jul 29, 2024 •

edited

Loading

cisaacstern commented Aug 8, 2024

Task: groupby (i.e. "split") #109

Task: groupby (i.e. "split") #109

Comments

cisaacstern commented Jul 29, 2024 • edited Loading

cisaacstern commented Aug 8, 2024

cisaacstern commented Jul 29, 2024 •

edited

Loading