Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pipelines] Design - Data orchestration input #1005

Open
Jason94 opened this issue Feb 27, 2024 · 2 comments
Open

[Pipelines] Design - Data orchestration input #1005

Jason94 opened this issue Feb 27, 2024 · 2 comments
Labels
Pipelines Project This issue is for the Pipelines project

Comments

@Jason94
Copy link
Collaborator

Jason94 commented Feb 27, 2024

Overview

The Pipelines project targets two user groups. One of them are advanced users who are already fluent in Python. One of the main value-add features of pipelines to advanced users is easy data orchestration integration. Data orchestration gives many benefits, such as error logging, data visibility, etc. It is a key goal of the pipelines system that you get drop-in data orchestration of your pipelines "for free."

Currently (2/27/2024) the pipelines branch has hard-coded Prefect integration. This is a good proof of concept, as the Prefect integration is entirely behind the scenes. However, because Prefect is closed source and cloud based, it's not acceptable to lock pipelines into that tool.

Discussion

The initial goal of this discussion is to gather input from the community about:

  1. What data orchestration tools you use
  2. How you use those data orchestration tools
  3. How your code interacts with those data orchestration tools.

Once we have collected data about a wide variety of tools, we will design an abstraction that allows the pipelines system to work with as many data orchestration tools as possible. Then, data orchestration "plugins" that target the abstraction can be added either inside or outside of Parsons, allowing pipelines to be used with any data orchestration platform.

Without a thorough discussion of different data orchestration use cases, we risk designing an abstraction that cannot accommodate many of the tools that pipelines users will want to target in their code.

@Jason94 Jason94 added enhancement Impact - something should be added to or changed about Parsons that isn't causing a current breakage Pipelines Project This issue is for the Pipelines project and removed enhancement Impact - something should be added to or changed about Parsons that isn't causing a current breakage labels Feb 27, 2024
@austinweisgrau
Copy link
Collaborator

austinweisgrau commented Feb 28, 2024

TBH I'm not sure I understand the concept of how Parsons could implement orchestration. As I think of it, orchestration really requires cloud infrastructure to be provisioned and configured, including code storage in the cloud (dockerizing and pushing to a docker store or copying code to s3), cloud compute, cloud secret storage for access in production, a healthy layer of IAM roles for development access and appropriately scoped execution privileges, billing information / a credit card on file, etc. etc.

For the Prefect example, wrapping a python script in @prefect.flow doesn't actually implement "orchestration," it just means that if that script is run, it will be logged in Prefect Cloud (if a Prefect Cloud account exists and appropriate API keys are set in the environment). Orchestration would also involve bundling the script as a Prefect deployment with a schedule and setting up a cloud execution layer (Prefect doesn't run an execution layer like some other orchestration platforms like Airflow does, it leaves that up to the user to set up).

Most of this feels outside the scope of what a python package (Parsons) can really implement

@Jason94
Copy link
Collaborator Author

Jason94 commented Feb 29, 2024

Austin, those are some good points. Here is what the current Prefect implementation provides and what it doesn't:

Here is what it doesn't provide:

  • Providing a cloud platform to execute your code (Civis, Airflow like you said)
  • Automatically scheduling your jobs to run, either locally or via the cloud
  • Containerize anything or manage the environment (Personally I would consider this "dev ops" not "data orchestration", but it doesn't really matter)

Here is what it does:

  • Providing visibility into what scripts are running, what those scripts are doing via a cloud interface
  • Provide better visibility into where, when, and why errors occurred.

I'm not convinced that it couldn't help with scheduling and some of that other stuff, although it'd be pretty dependent on whatever plugins we built. An option I looked into was Apache Airflow, which would be possible to integrate in a similar way to how Prefect is currently handled, I think.

I think you raise three really good questions for this part of the design:

  1. Maybe there is a better name for what we're talking about here than "data orchestration".
  2. What set of "data orchestration" features do our users want? Austin, would some of the things you mentioned, like scheduling or interacting with cloud infrastructure would you ideally like to see?
  3. Based on the answer to 2, what are some tools we could build off of that could provide more of those than what the current Prefect integration is giving? As I mentioned, Apache Airflow is a high-power library that could be of use. There's also probably more to Prefect that I haven't explored. What's currently in there is more of a proof-of-concept than anything. If someone's used Prefect more than I have, it'd be great if they could weigh in on what some of these kinds of tasks Prefect is and isn't capable of handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pipelines Project This issue is for the Pipelines project
Projects
None yet
Development

No branches or pull requests

2 participants