Update the packaging and deployment process for Dagster and bootstrap dbt project #240

blarghmatey · 2022-06-08T20:13:45Z

As part of our effort to bootstrap our data platform we need to have a solid foundation to build from. This restructures the repository to be aligned with the purpose of being the central location for all data platform related code.

Install and configure Pants as the build tool (which we use for our other mon-repos)
Restructure the Dagster pipelines to be oriented around the core abstractions of ops, graphs, and jobs, rather than segmenting logic around specific pipelines. This will allow for us to more easily scale the definition and re-use of process logic and encapsulate business needs along the repository groupings.
Bootstraps a dbt project with some initial configuration
Adds build logic to allow for easily packaging Dagster repositories (collections of pipelines, sensors, and schedules) into discrete Docker images so that they can be deployed independently
Restructure the directory/module naming to be more indicative of what the purpose is

blarghmatey · 2022-06-08T20:42:20Z

This addresses the work needed for #233

shaidar · 2022-06-14T17:10:21Z

pants.toml

+  'pants.backend.python.typecheck.mypy',
+  'pants.backend.shell',
+  'pants.backend.shell.lint.shellcheck',
+  'pants.backend.shell.lint.shfmt',


Might be a good idea to add bandit to the list

blarghmatey · 2022-06-15T20:06:59Z

This addresses the changes needed for #233

We are working on building out a data platform with the core building block being a data lakehouse that is populated and managed with an ELT workflow. To that end this adds additional directory structures, and renames others, to make it clearer which parts of the code are for which task. - Rename ol_data_pipelines -> ol_orchestrate to make it clear that this code is relevant to data orchestration workflows - Create an ol_dbt directory for holding dbt model definitions and associated configurations and workflows - Add Pants to the repository for handling build and packaging of Dagster orchestration pipelines to simplify build and deployment workflows

In order to reduce the overhead of building and deploying the different pipelines this updates the structure of the repository and build flow. The goal is to have BUILD targets defined that specify the Dagster repository to package, build that into a Python distribution, and install that distribution into a Docker image that will get published for deployment. This also moves to a layout with ops/jobs/graphs as the primary concern so that we can abstract the actual tasks across the different business concerns to improve logic reuse.

As part of the updated build/deployment we want to have separate images for the Dagit and dagaster-daemon processes, which are also separate from the user pipeline code so that they can all be built, deployed, and scaled independently. For the user pipelines we also want to ensure that the dbt project is available in the runtime environment. This does the following: - Copy all files related to the dbt project into user pipeline images by default - Create a multi-stage build for Dagit/dagster-daemon to avoid duplicate logic - Moves the Dagster-specific workspace and Dagster yaml files into the `ol_orchestrate` directory - Moves the dbt project files to the proper directory level in the repo - Adds the initial work to package up collections of Dagster pipelines based on the 'repository' as the entry-point for the Python distribution

Start iterating on how to run image builds in Concourse pipelines to ensure that build and push workflows are automated to streamline deployment considerations.

- tag images with `docker_image` to allow for filtering in pants command - package all targets tagged with `docker_image`

We are no longer relying on Invoke in this repository and the package script that it was managing is not the way that we will be building and deploying pipelines going forward. This removes the `tasks.py` file that was used for those build steps.

…latest and the dagster version.

…oing forward.

… the package name

By having too many types for disparate use cases it encourages mixing of concerns. This splits the types into more bounded domains.

… files.

.concourse/build_containers.yaml

dockerfiles/orchestrate/BUILD

pants.toml

src/ol_dbt/models/example/my_first_dbt_model.sql

src/ol_orchestrate/dagster.yaml

src/ol_orchestrate/jobs/open_edx.py

src/ol_orchestrate/ops/elt.py

src/ol_orchestrate/ops/open_edx.py

The PEX binaries that are being created for running the dagit and dagster-daemon processes were being defined separate from the locations where they were being used. This consolidates them so that they are in the same BUILD file for more clarity.

shaidar · 2022-06-30T12:54:08Z

👍

blarghmatey added enhancement New feature or request Data DevOps labels Jun 8, 2022

blarghmatey added this to the Data Platform - Phase 1 milestone Jun 8, 2022

blarghmatey self-assigned this Jun 8, 2022

blarghmatey force-pushed the elt_refactor branch from 863294a to 9488543 Compare June 8, 2022 20:21

shaidar reviewed Jun 14, 2022

View reviewed changes

blarghmatey force-pushed the elt_refactor branch 2 times, most recently from 66462c6 to 7a4f010 Compare June 15, 2022 19:42

blarghmatey added 7 commits June 15, 2022 16:07

Updating version locks

3b55200

Add initial build pipeline and clean up docker config

3ff2952

Start iterating on how to run image builds in Concourse pipelines to ensure that build and push workflows are automated to streamline deployment considerations.

Update image build pipeline starter to build all images

004d7b7

- tag images with `docker_image` to allow for filtering in pants command - package all targets tagged with `docker_image`

Remove defunct Invoke packaging task

f914248

We are no longer relying on Invoke in this repository and the package script that it was managing is not the way that we will be building and deploying pipelines going forward. This removes the `tasks.py` file that was used for those build steps.

blarghmatey force-pushed the elt_refactor branch from 061d599 to f914248 Compare June 15, 2022 20:08

Ardiea and others added 11 commits June 16, 2022 14:04

Updated the pants config to publish the docker images to dockerhub.io

4b07ea0

Updating the pants test pipeline to publish to docker hub and tag as …

64748c7

…latest and the dagster version.

updated workspace.yaml to reflect the gRPC config that will be used g…

e2ec31c

…oing forward.

Testing an alt build args for the user pipelines container to take in…

83337a3

… the package name

Changing the CMD format for the user_pipeline.

eea501f

Changing the CMD format for the user_pipeline.

9fe4a08

refactor: Clean up dagster types to reduce mixing of concerns

3a18966

By having too many types for disparate use cases it encourages mixing of concerns. This splits the types into more bounded domains.

Add missing init file for dagster_types

cbc305b

Tweaking the edx pipelins image further.

22de813

Adjusted the docker file.

4bc20fd

Testing the docker container build process and dagster.yaml changes.

c0f62c7

Ardiea added 2 commits June 24, 2022 14:58

Testing the docker container build process and dagster.yaml changes.

06e17a8

Adding dagster-aws to the reqs for the dagster and dagster daemon pex…

5dc42a6

… files.

blarghmatey marked this pull request as ready for review June 29, 2022 20:01