Skip to content

Code Conventions

fvankrieken edited this page Jun 10, 2024 · 12 revisions

Warning

This page is a bit of a stub at the moment

Code formatting

We use black and sqlfluff to lint and format our python and SQL code.

black --diff --color --check directory/file.py
black directory/file.py

sqlfluff lint products/.../file.sql
sqlfluff fix products/.../file.sql

For code comments, we recommend using consistent tags (inspired by Better Comments):

# TODO refactor this function to make it faster
# ! Deprecated function, do not use
# * This is an important note
# ? Should this variable be renamed for consistency?

dbt

See dbt's own style guide for reference.

We use dbt-checkpoint to validate our dbt project conventions.

# Environment variables required by the product's profiles.yml file must be set
dbt deps --profiles-dir products/product_directory --project-dir products/product_directory
dbt seed --profiles-dir products/product_directory --project-dir products/product_directory
pre-commit run --all-files

Model Folders/File Structure

We largely follow dbt's conventions but don't love the term "marts" for product/output tables, so we have

  • staging
  • intermediate
  • product

product

product models are output tables. They often have columns renamed for the purpose of business users. Every table that is exported and packaged as part of a build should be defined here.

Product tables do not need a prefix.

staging

staging models largely follows dbt's idea - any preprocessing step that does not join other tables and does relatively simple operations on data without fundamentally changing the structure of the data can be a staging table.

All staging tables should have stg__ prefix.

We're still deciding a bit if every data source needs a staging table, for now we're tentatively saying intermediate tables can directly reference source tables. But if you find yourself renaming columns, padding strings, etc. on a source table in an intermediate script, you should probably create a staging table. There is a bit of an exception in which what could be a staging table would be better suited in an intermediate table. But before getting into that...

intermediate

intermediate models are everything else, i.e. the actual "transformation" logic of the pipeline.

All intermediate tables should have int__ prefixes. It's perfectly fine to have a bunch of intermediate files all at the root level of this folder, if they're logically named. However, we encourage grouping by subfolders within the intermediate folder logically by the entities represented.

In the case of green_fast_track, many different data sources had transformations applied and then were buffered to create many tables of buffered geometries, to be used in the logic of flagging pluto lots. These can all go in intermediate/buffers (with prefixes to filenames int_buffers__). This is where an exception to the staging logic above might apply. Some buffers were inherently complex and required joining tables to calculate. Therefore, they do not belong in staging, so the buffers folder is in intermediate and not staging. However, there are also buffers that were created by performing relatively basic operations on a single data source. This could very much live in staging. But since other buffers are already being grouped in an intermediate subfolder, any would-be staging buffers can be put there instead.

Style guides

Code

Data

Learning resources

Links