Skip to content
This repository has been archived by the owner on Jul 13, 2023. It is now read-only.

🆕 DBT for Pluto #413

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

🆕 DBT for Pluto #413

wants to merge 5 commits into from

Conversation

SPTKL
Copy link
Contributor

@SPTKL SPTKL commented Nov 29, 2022

Hi NYCPlanning! I miss you all! I've been using DBT for my new job and it's amazing! I'm creating this PR to just give you a taste on how things can look like if Pluto adopted DBT.

I tried to do this when I was still at NYCPlanning, but didn't have enough hands on knowledge on how to structure things.

Things DBT solves in the Pluto context

  1. table lineage, dbt can figure out the dependencies to build a table, and knows how to create tables in parallel, so we no longer need to maintain a super long 02_build.sh that manually specifies execution order
  2. documentation, dbt provides full support for table level documentation and column level documentation, see (pluto_build/models/sources.yml and pluto_build/models/staging/schema.yml)
  3. the best thing about the docs is that you can see it in a browser, so you can have a public facing documentation website that's searchable! You can use github actions to build the dbt models and serve the documentation website using github pages! All for free!

image

for each table you also get to see the SQL code that generates it!

image

you can also use the lineage graph to see table lineage

image

NOTE: I tried to mimic a lot of these behavior in devdb by adding comment headers in sql files, but this is way way better


From a development perspective

  1. You can easily maintain multiple environment without having to maintain multiple complicated .env using a single ~/.dbt/profiles.yml! e.g.
nycplanning:
  target: pluto-dev
  outputs:
    pluto-dev:
      type: postgres
      host: localhost
      user: postgres
      password: postgres
      port: 5432
      dbname: postgres
      schema: public
      threads: 4
 target: pluto-prod
  outputs:
    pluto-prod:
      type: postgres
      host: digitalocean.hosted.database.url.com
      user: digitalocean
      password: XXXXXX
      port: 5432
      dbname: postgres
      schema: public
      threads: 4
 target: devdb-dev
  outputs:
    devdb-dev:
      type: postgres
      host: localhost
      user: postgres
      password: postgres
      port: 5433
      dbname: postgres
      schema: public
      threads: 4

this way, you can switch across different projects / prod or dev environment easily in different repo by specifying the specific target -> dbt run xxxxxx --target devdb-prod . This is helpful because you can ensure the environment set up is consistent across team members by only having to maintain 1 file instead of 1 file for each repo/project.
2. DBT makes it easy to create schema / create table / view or replace them when you want to rerun some code. In the pluto repo, there are a lot of code to DROP TABLE IF EXISTS or CREATE TABLE / CREATE SCHEMA which adds a lot of bulk to the code, DBT abstracted all that away so you can focus on the business logic -> usually stated in a SELECT statement.
3. We tried to start doing this in DevDB, it is recommended to use SELECT for business logic because it's more declarative and more transparent compared to INSERT or UPDATE.
4. Testing is also made easy, especially in the context of Pluto, we always want to make sure e.g. there's no duplicated BBL in certain tables vs another, you can easily do so by using the dbt_utils package out of the box. e.g.

version: 2

models:
 - name: stg_geocodes
   description: cleaned table for geocodes
   columns:
     - name: geo_bbl
       tests:
         - unique
         - not_null
     - name: borough
       tests:
         - not_null
         - accepted_values:
             values: [1, 2, 3, 4, 5]

this would conduct the following tests:
for column geo_bbl for table stg_geocodes check the field is unique and not null
for column borough, check the field is not null and contains only values in [1, 2, 3, 4, 5]
The dbt test command makes it really easy to implement some of the QAQC checks that gave us a lot of a headache.


Not implemented, but might be useful

  • dbt seed -> importing static csv files in the repo into database
  • dbt analyses -> running analysis on your tables, in the pluto context, these can be statistics / metrics & etc
  • jinja templating -> using jinja, you can run for loops, making it easy to generate a pivoted table e.g. new housing units by year

Good luck! lemme know if you have questions! I'm always on github! say hi to everyone for me thanks!

@damonmcc damonmcc marked this pull request as draft February 10, 2023 15:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant