🆕 DBT for Pluto #413

SPTKL · 2022-11-29T14:13:39Z

Hi NYCPlanning! I miss you all! I've been using DBT for my new job and it's amazing! I'm creating this PR to just give you a taste on how things can look like if Pluto adopted DBT.

I tried to do this when I was still at NYCPlanning, but didn't have enough hands on knowledge on how to structure things.

Things DBT solves in the Pluto context

table lineage, dbt can figure out the dependencies to build a table, and knows how to create tables in parallel, so we no longer need to maintain a super long 02_build.sh that manually specifies execution order
documentation, dbt provides full support for table level documentation and column level documentation, see (pluto_build/models/sources.yml and pluto_build/models/staging/schema.yml)
the best thing about the docs is that you can see it in a browser, so you can have a public facing documentation website that's searchable! You can use github actions to build the dbt models and serve the documentation website using github pages! All for free!

for each table you also get to see the SQL code that generates it!

you can also use the lineage graph to see table lineage

NOTE: I tried to mimic a lot of these behavior in devdb by adding comment headers in sql files, but this is way way better

From a development perspective

You can easily maintain multiple environment without having to maintain multiple complicated .env using a single ~/.dbt/profiles.yml! e.g.

nycplanning:
  target: pluto-dev
  outputs:
    pluto-dev:
      type: postgres
      host: localhost
      user: postgres
      password: postgres
      port: 5432
      dbname: postgres
      schema: public
      threads: 4
 target: pluto-prod
  outputs:
    pluto-prod:
      type: postgres
      host: digitalocean.hosted.database.url.com
      user: digitalocean
      password: XXXXXX
      port: 5432
      dbname: postgres
      schema: public
      threads: 4
 target: devdb-dev
  outputs:
    devdb-dev:
      type: postgres
      host: localhost
      user: postgres
      password: postgres
      port: 5433
      dbname: postgres
      schema: public
      threads: 4

this way, you can switch across different projects / prod or dev environment easily in different repo by specifying the specific target -> dbt run xxxxxx --target devdb-prod . This is helpful because you can ensure the environment set up is consistent across team members by only having to maintain 1 file instead of 1 file for each repo/project.
2. DBT makes it easy to create schema / create table / view or replace them when you want to rerun some code. In the pluto repo, there are a lot of code to DROP TABLE IF EXISTS or CREATE TABLE / CREATE SCHEMA which adds a lot of bulk to the code, DBT abstracted all that away so you can focus on the business logic -> usually stated in a SELECT statement.
3. We tried to start doing this in DevDB, it is recommended to use SELECT for business logic because it's more declarative and more transparent compared to INSERT or UPDATE.
4. Testing is also made easy, especially in the context of Pluto, we always want to make sure e.g. there's no duplicated BBL in certain tables vs another, you can easily do so by using the dbt_utils package out of the box. e.g.

version: 2

models:
 - name: stg_geocodes
   description: cleaned table for geocodes
   columns:
     - name: geo_bbl
       tests:
         - unique
         - not_null
     - name: borough
       tests:
         - not_null
         - accepted_values:
             values: [1, 2, 3, 4, 5]

this would conduct the following tests:
for column geo_bbl for table stg_geocodes check the field is unique and not null
for column borough, check the field is not null and contains only values in [1, 2, 3, 4, 5]
The dbt test command makes it really easy to implement some of the QAQC checks that gave us a lot of a headache.

Not implemented, but might be useful

dbt seed -> importing static csv files in the repo into database
dbt analyses -> running analysis on your tables, in the pluto context, these can be statistics / metrics & etc
jinja templating -> using jinja, you can run for loops, making it easy to generate a pivoted table e.g. new housing units by year

Good luck! lemme know if you have questions! I'm always on github! say hi to everyone for me thanks!

SPTKL added 5 commits November 28, 2022 23:48

testing out dbt

556099f

adding documentation

c10fa56

update gitignore

6a382f9

revert back

4ba2b4a

calculate areas

b44612f

damonmcc marked this pull request as draft February 10, 2023 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🆕 DBT for Pluto #413

🆕 DBT for Pluto #413

SPTKL commented Nov 29, 2022 •

edited

Loading

🆕 DBT for Pluto #413

Are you sure you want to change the base?

🆕 DBT for Pluto #413

Conversation

SPTKL commented Nov 29, 2022 • edited Loading

Things DBT solves in the Pluto context

From a development perspective

Not implemented, but might be useful

SPTKL commented Nov 29, 2022 •

edited

Loading