Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the packaging and deployment process for Dagster and bootstrap dbt project #240

Merged
merged 23 commits into from
Jun 30, 2022

Conversation

blarghmatey
Copy link
Member

As part of our effort to bootstrap our data platform we need to have a solid foundation to build from. This restructures the repository to be aligned with the purpose of being the central location for all data platform related code.

  • Install and configure Pants as the build tool (which we use for our other mon-repos)
  • Restructure the Dagster pipelines to be oriented around the core abstractions of ops, graphs, and jobs, rather than segmenting logic around specific pipelines. This will allow for us to more easily scale the definition and re-use of process logic and encapsulate business needs along the repository groupings.
  • Bootstraps a dbt project with some initial configuration
  • Adds build logic to allow for easily packaging Dagster repositories (collections of pipelines, sensors, and schedules) into discrete Docker images so that they can be deployed independently
  • Restructure the directory/module naming to be more indicative of what the purpose is

@blarghmatey
Copy link
Member Author

This addresses the work needed for #233

'pants.backend.python.typecheck.mypy',
'pants.backend.shell',
'pants.backend.shell.lint.shellcheck',
'pants.backend.shell.lint.shfmt',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to add bandit to the list

@blarghmatey blarghmatey force-pushed the elt_refactor branch 2 times, most recently from 66462c6 to 7a4f010 Compare June 15, 2022 19:42
@blarghmatey
Copy link
Member Author

This addresses the changes needed for #233

We are working on building out a data platform with the core building block being a data
lakehouse that is populated and managed with an ELT workflow. To that end this adds
additional directory structures, and renames others, to make it clearer which parts of
the code are for which task.

- Rename ol_data_pipelines -> ol_orchestrate to make it clear that this code is relevant
to data orchestration workflows
- Create an ol_dbt directory for holding dbt model definitions and associated
configurations and workflows
- Add Pants to the repository for handling build and packaging of Dagster orchestration
pipelines to simplify build and deployment workflows
In order to reduce the overhead of building and deploying the different pipelines this
updates the structure of the repository and build flow.

The goal is to have BUILD targets defined that specify the Dagster repository to
package, build that into a Python distribution, and install that distribution into a
Docker image that will get published for deployment.

This also moves to a layout with ops/jobs/graphs as the primary concern so that we can
abstract the actual tasks across the different business concerns to improve logic reuse.
As part of the updated build/deployment we want to have separate images for the Dagit
and dagaster-daemon processes, which are also separate from the user pipeline code so
that they can all be built, deployed, and scaled independently. For the user pipelines
we also want to ensure that the dbt project is available in the runtime
environment. This does the following:
- Copy all files related to the dbt project into user pipeline images by default
- Create a multi-stage build for Dagit/dagster-daemon to avoid duplicate logic
- Moves the Dagster-specific workspace and Dagster yaml files into the `ol_orchestrate` directory
- Moves the dbt project files to the proper directory level in the repo
- Adds the initial work to package up collections of Dagster pipelines based on the 'repository' as the entry-point for the Python distribution
Start iterating on how to run image builds in Concourse pipelines to ensure that build
and push workflows are automated to streamline deployment considerations.
- tag images with `docker_image` to allow for filtering in pants command
- package all targets tagged with `docker_image`
We are no longer relying on Invoke in this repository and the package script that it was
managing is not the way that we will be building and deploying pipelines going
forward. This removes the `tasks.py` file that was used for those build steps.
@blarghmatey blarghmatey marked this pull request as ready for review June 29, 2022 20:01
pants.toml Show resolved Hide resolved
The PEX binaries that are being created for running the dagit and dagster-daemon
processes were being defined separate from the locations where they were being
used. This consolidates them so that they are in the same BUILD file for more clarity.
@blarghmatey blarghmatey requested a review from shaidar June 30, 2022 12:53
@shaidar
Copy link
Contributor

shaidar commented Jun 30, 2022

👍

@blarghmatey blarghmatey merged commit 837284a into main Jun 30, 2022
@blarghmatey blarghmatey deleted the elt_refactor branch June 30, 2022 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data DevOps enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants