-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split apply combine #45
base: legacy
Are you sure you want to change the base?
Conversation
@walljcg here's a start on how we can leverage lithops map reduce (as demonstrated in #28 (comment)) in the context of ecoscope workflows/tasks. The core task is here:
And here is an example of what it might look like compiled into a runnable workflow: ecoscope-workflows/examples/dags/time_density_map_reduce.script_lithops.py Lines 79 to 104 in 6e416b6
💡 Note that this does not yet run, but is more-so one step above pseudocode. Among other things, of course we don't actually need to split-apply-combine for time density maps in this way, I'm just using that as a toy example here. In terms of where a script like this could run, if it runs locally, it can parallelize across python processes. In the cloud, we could package this script itself as a "launcher" serverless function on GCP Cloud Run, which once it gets to the lithops map-reduce section, would spawn additional Cloud Run functions. The "launcher" function (running this script) would then wait for all of the parallelized tasks to complete, gather the result, and send it back to wherever we need it (database api call to the server maybe). Note also that this pattern (launch lithops from another cloud function) is conceptually almost identical to the Lithops Airflow Operator. (Which we may also want to use in the future for more complex DAGs, but "deploy lithops from a cloud function" is definitely lower latency for "easy/small" DAGs, as we've discussed.) I've also iterated a bit on the workflow YAML spec design, so we can represent the map reduce operation there something like this: ecoscope-workflows/examples/compilation-specs/time-density.yaml Lines 39 to 55 in 6e416b6
To make this more legible, I'm experimenting with borrowing the GitHub Actions data structure of: - name: "a human readable name"
id: |
# a unique id. we were using the task names for this before,
# but to support reuse of the same task within a workflow, we'll need an `id`
uses: # the action name in github, or for us, the task importable reference
from: |
# i'm not actually using this here, and i don't think it's part of GitHub Actions,
# but i thought this might be a nice way to include the github or pypi path for
# extension task packages, which could be dynamically installed in a venv at compile
# time (like pre-commit). https://github.com/moradology/venvception is a nice little
# package that does this (ephemeral venvs), that a former collaborator wrote for our
# last project.
with:
# kwargs to pass to the task |
…to split-apply-combine
TODO:
e.g. "filters": [
{
"title": "Animal Name",
"key": "animal_name",
"oneOfEnum": {
"type": "string",
"oneOf": [
{
"const": "Ao",
"title": "Ao"
},
{
"const": "Bo",
"title": "Bo"
}
]
} |
From discussion with @juanlescano-ng, for compatibility with {
"schema": {
"type": "object",
"properties": {
"month": {
"type": "string",
"enum": [
"January",
"February"
],
"enumNames": [
"January",
"February"
],
"default": "January"
},
"animal_name": {
"type": "string",
"enum": [
"Ao",
"Bo"
],
"enumNames": [
"Ao",
"Bo"
],
"default": "Ao"
}
},
"uiSchema": {
"animal_name": {
"ui:title": "Animal Name",
"ui:emptyValue": "Ao",
"ui:help": "The name of the elephant"
},
"month": {
"ui:title": "Month",
"ui:help": "The month of the year."
}
}
} this can be demonstrated by copy-and-paste here: |
Here's a demo of this in action: Screen.Recording.2024-07-16.at.10.52.00.AM.mov |
Closes #28