Split apply combine #45

cisaacstern · 2024-06-24T03:05:17Z

Closes #28

cisaacstern · 2024-06-24T03:23:12Z

@walljcg here's a start on how we can leverage lithops map reduce (as demonstrated in #28 (comment)) in the context of ecoscope workflows/tasks. The core task is here:

ecoscope-workflows/ecoscope_workflows/tasks/parallelism/_lithops.py

Lines 8 to 38 in 6e416b6

    
           @distributed 
        
           def map_reduce( 
        
               groups: Annotated[list[tuple[str, ...]], Field()], 
        
               # lets asssume that when we call map_reduce, we have already set the 
        
               # arg_prevalidators and return_postvalidator for the mappers and reducer 
        
               # so by the time lithops sees them, they are ready to call 
        
               mappers: Annotated[list[tuple[DistributedTask, dict]], Field()], 
        
               reducer: Annotated[DistributedTask, Field()], 
        
               reducer_kwargs: Annotated[dict, Field(default_factory=dict)], 
        
           ): 
        
               import lithops 
        
               # Configure the compute backend (local, cloud, etc.) with a configuration file: 
        
               # https://lithops-cloud.github.io/docs/source/configuration.html#configuration-file 
        
               fexec = lithops.FunctionExecutor() 
        
               def fused_mapper(element): 
        
                   for i, (mapper, kwargs) in enumerate(mappers): 
        
                       if i == 0: 
        
                           result = mapper(element, **kwargs) 
        
                       else: 
        
                           result = mapper(result, **kwargs) 
        
                   return result 
        
               fexec.map_reduce( 
        
                   map_function=fused_mapper, 
        
                   map_iterdata=groups, 
        
                   reduce_function=reducer, 
        
                   extra_args_reduce=reducer_kwargs, 
        
               ) 
        
               return fexec.get_result()

And here is an example of what it might look like compiled into a runnable workflow:

ecoscope-workflows/examples/dags/time_density_map_reduce.script_lithops.py

Lines 79 to 104 in 6e416b6

    
           # this can parallelize on local threads, gcp cloud run, or other cloud serverless 
        
           # compute backends, depending on the lithops configuration set at runtime. 
        
           map_reduce_return = map_reduce( 
        
               groups=split_groups_return, 
        
               mappers=[ 
        
                   ( 
        
                       draw_ecomap.replace( 
        
                           # the dataframe is given here as a list of URIs to parquet files, 
        
                           # so we need to deserialize it back into a geodataframe 
        
                           arg_prevalidators={"geodataframe": gpd_from_parquet_uri}, 
        
                           return_postvalidator=functools.partial( 
        
                               html_text_to_uri, 
        
                               uri=os.environ["ECOSCOPE_WORKFLOWS_TMP"], 
        
                           ), 
        
                           validate=True, 
        
                       ), 
        
                       set_map_styles_return, 
        
                   ), 
        
                   ( 
        
                       map_to_widget.replace(validate=True), 
        
                       {}, 
        
                   ), 
        
               ], 
        
               reducer=gather_dashboard, 
        
               reducer_kwargs=set_groupers_return, 
        
           )

💡 Note that this does not yet run, but is more-so one step above pseudocode. Among other things, of course we don't actually need to split-apply-combine for time density maps in this way, I'm just using that as a toy example here.

In terms of where a script like this could run, if it runs locally, it can parallelize across python processes. In the cloud, we could package this script itself as a "launcher" serverless function on GCP Cloud Run, which once it gets to the lithops map-reduce section, would spawn additional Cloud Run functions. The "launcher" function (running this script) would then wait for all of the parallelized tasks to complete, gather the result, and send it back to wherever we need it (database api call to the server maybe).

Note also that this pattern (launch lithops from another cloud function) is conceptually almost identical to the Lithops Airflow Operator. (Which we may also want to use in the future for more complex DAGs, but "deploy lithops from a cloud function" is definitely lower latency for "easy/small" DAGs, as we've discussed.)

I've also iterated a bit on the workflow YAML spec design, so we can represent the map reduce operation there something like this:

ecoscope-workflows/examples/compilation-specs/time-density.yaml

Lines 39 to 55 in 6e416b6

    
           - name: Map reduce 
        
             id: map-reduce 
        
             uses: ecoscope_workflows.tasks.parallelism.map_reduce 
        
             with: 
        
               groups: split-groups.return 
        
               mappers: 
        
                 - uses: ecoscope_workflows.tasks.results.draw_ecomap 
        
                   with: 
        
                     # kws is a reserved name that indicates the value 
        
                     # will be a dict and should be unpacked as kwargs 
        
                     kws: set-map-styles.return 
        
                 - uses: ecoscope_workflows.tasks.results.map_to_widget 
        
                   with: {} 
        
               reducer: 
        
                 uses: ecoscope_workflows.tasks.results.gather_dashboard 
        
                 with: 
        
                   groupers: set-groupers.return

To make this more legible, I'm experimenting with borrowing the GitHub Actions data structure of:

- name: "a human readable name"
  id: |
    # a unique id. we were using the task names for this before,
    # but to support reuse of the same task within a workflow, we'll need an `id`
  uses:  # the action name in github, or for us, the task importable reference
  from: |
    # i'm not actually using this here, and i don't think it's part of GitHub Actions,
    # but i thought this might be a nice way to include the github or pypi path for
    # extension task packages, which could be dynamically installed in a venv at compile
    # time (like pre-commit). https://github.com/moradology/venvception is a nice little
    # package that does this (ephemeral venvs), that a former collaborator wrote for our
    # last project.
  with:
    # kwargs to pass to the task

…to split-apply-combine

cisaacstern · 2024-07-12T18:36:25Z

TODO:

redo filters with https://jsonforms.io/docs/multiple-choice/#one-of

e.g.

"filters": [
    {
      "title": "Animal Name",
      "key": "animal_name",
      "oneOfEnum": { 
          "type": "string",
          "oneOf": [
             {
               "const": "Ao",
               "title": "Ao"
              },
             {
               "const": "Bo",
               "title": "Bo"
             }
         ]
     }

cisaacstern · 2024-07-16T17:52:39Z

From discussion with @juanlescano-ng, for compatibility with react-jsonschema-form, we actually want to adjust the filters section so that it looks like this:

{
  "schema": {
  "type": "object",
  "properties": {
    "month": {
      "type": "string",
      "enum": [
        "January",
        "February"
      ],
      "enumNames": [
        "January",
        "February"
      ],
      "default": "January"
    },
    "animal_name": {
      "type": "string",
      "enum": [
        "Ao",
        "Bo"
      ],
      "enumNames": [
        "Ao",
        "Bo"
      ],
      "default": "Ao"
    }
  },
  "uiSchema": {
    "animal_name": {
      "ui:title": "Animal Name",
      "ui:emptyValue": "Ao",
      "ui:help": "The name of the elephant"
    },
    "month": {
      "ui:title": "Month",
      "ui:help": "The month of the year."
    }
  }
}

this can be demonstrated by copy-and-paste here:

https://rjsf-team.github.io/react-jsonschema-form/

cisaacstern · 2024-07-16T17:54:35Z

Here's a demo of this in action:

Screen.Recording.2024-07-16.at.10.52.00.AM.mov

cisaacstern added 2 commits June 23, 2024 14:28

map reduce yaml mock up

719202b

split apply combine mock up

6e416b6

cisaacstern changed the base branch from main to data-connections June 24, 2024 03:05

Merge remote-tracking branch 'origin/main' into split-apply-combine

d48cb1c

cisaacstern changed the base branch from data-connections to main June 25, 2024 16:29

cisaacstern mentioned this pull request Jun 25, 2024

Rework compilation spec schema to look more like GitHub Workflows #50

Closed

cisaacstern added 3 commits June 25, 2024 11:26

fill in annotations for importability

4fc63f4

assign from column attribute

648dd19

add test for assign from col attr

1042665

cisaacstern mentioned this pull request Jun 25, 2024

Task: Transformations - Assign from column attribute #51

Closed

cisaacstern added 19 commits June 25, 2024 18:43

Merge remote-tracking branch 'origin/assign-from-column-attribute' in…

313f99c

…to split-apply-combine

decorator typing upgrades

b574b93

lithops wip cont.

0899800

update typevar name

1a64abe

add return postvalidator typevar

c370c07

lithops demo wip., serialize groupby next

466d54a

lithops demo wip., serialize groupby next

694304b

pull in everything

dd91149

pull in everything

9d50eb3

generate hivekeys

205dd9a

time density from synthetic data woohoo

0db527e

reformat params

4318de0

aggregate map works

18b1d72

calling map_function on single element works

d4d7d65

wow map reduce works locally in a basic way

fb64c2a

ok, _now_ we have an actually collection

c050d07

actually, gather_widget

11a5fbb

combine map_function into single def

c1a9291

clean up reducer

b3e5ba5

cisaacstern added 4 commits July 9, 2024 22:10

use new ecomap

1d62304

merge widgets into dashboard

8071c44

Merge remote-tracking branch 'origin/main' into split-apply-combine

3adc825

add ecoplot placeholder

a516563

atmorling mentioned this pull request Jul 10, 2024

Add logging to tasks #65

Open

cisaacstern added 10 commits July 10, 2024 09:08

parse urls to public https

a7cd971

public results urls

dc59b7c

lots of dashboard updates

765c898

new dashboard runs e2e

30cb7e2

metadata, enumerate widget ids, custom serializer

de2eb57

revert plots oops

b18eb3c

metadata defaults

bd93bac

parse groupers to filters, layout placeholder

eb7399d

lint

1bc504b

dump dashboard to json

a698676

This was referenced Jul 16, 2024

Task: Create Dashboard #68

Closed

Pydantic JSON Schema generator for react-jsonschema-form #82

Closed

[Task] Persist text result to local or remote storage #89

Merged

Tasks for creating and merging widget views #91

Merged

This was referenced Jul 26, 2024

Yaml spec expressiveness #90

Merged

YAML spec: support parallel mapping of a task over an iterable input #107

Closed

Task: groupby (i.e. "split") #109

Closed

Compiler: fused chained map #110

Open

Set groupers, split groups (and mode="mapvalues") #120

Closed

cisaacstern added the stale label Aug 8, 2024

This was referenced Aug 14, 2024

Realistic data generators + example results data for patrols workflow #181

Closed

Lithops executor #179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split apply combine #45

Split apply combine #45

cisaacstern commented Jun 24, 2024 •

edited

Loading

cisaacstern commented Jun 24, 2024 •

edited

Loading

cisaacstern commented Jul 12, 2024

cisaacstern commented Jul 16, 2024

cisaacstern commented Jul 16, 2024

Split apply combine #45

Are you sure you want to change the base?

Split apply combine #45

Conversation

cisaacstern commented Jun 24, 2024 • edited Loading

cisaacstern commented Jun 24, 2024 • edited Loading

cisaacstern commented Jul 12, 2024

cisaacstern commented Jul 16, 2024

cisaacstern commented Jul 16, 2024

cisaacstern commented Jun 24, 2024 •

edited

Loading

cisaacstern commented Jun 24, 2024 •

edited

Loading