-
Notifications
You must be signed in to change notification settings - Fork 0
Code Conventions
Warning
This page is a bit of a stub at the moment
We use black
and sqlfluff
to lint and format our python and SQL code.
black --diff --color --check directory/file.py
black directory/file.py
sqlfluff lint products/.../file.sql
sqlfluff fix products/.../file.sql
For code comments, we recommend using consistent tags (inspired by Better Comments):
# TODO refactor this function to make it faster
# ! Deprecated function, do not use
# * This is an important note
# ? Should this variable be renamed for consistency?
See dbt's own style guide for reference.
We use dbt-checkpoint
to validate our dbt project conventions.
# Environment variables required by the product's profiles.yml file must be set
dbt deps --profiles-dir products/product_directory --project-dir products/product_directory
dbt seed --profiles-dir products/product_directory --project-dir products/product_directory
pre-commit run --all-files
We largely follow dbt's conventions but don't love the term "marts" for product/output tables, so we have
- staging
- intermediate
- product
product models are output tables. They often have columns renamed for the purpose of business users. Every table that is exported and packaged as part of a build should be defined here.
Product tables do not need a prefix.
staging models largely follows dbt's idea - any preprocessing step that does not join other tables and does relatively simple operations on data without fundamentally changing the structure of the data can be a staging table.
All staging tables should have stg__
prefix.
We're still deciding a bit if every data source needs a staging table, for now we're tentatively saying intermediate tables can directly reference source tables. But if you find yourself renaming columns, padding strings, etc. on a source table in an intermediate script, you should probably create a staging table. There is a bit of an exception in which what could be a staging table would be better suited in an intermediate table. But before getting into that...
intermediate models are everything else, i.e. the actual "transformation" logic of the pipeline.
All intermediate tables should have int__
prefixes. It's perfectly fine to have a bunch of intermediate files all at the root level of this folder, if they're logically named. However, we encourage grouping by subfolders within the intermediate folder logically by the entities represented.
In the case of green_fast_track
, many different data sources had transformations applied and then were buffered to create many tables of buffered geometries, to be used in the logic of flagging pluto lots. These can all go in intermediate/buffers
(with prefixes to filenames int_buffers__
). This is where an exception to the staging logic above might apply. Some buffers were inherently complex and required joining tables to calculate. Therefore, they do not belong in staging, so the buffers
folder is in intermediate and not staging. However, there are also buffers that were created by performing relatively basic operations on a single data source. This could very much live in staging. But since other buffers are already being grouped in an intermediate subfolder, any would-be staging buffers can be put there instead.
- python
- sql database connections (Gitlab orchestration_utils example)
- sql
- bash
- raising errors with
set -e
- raising errors with
- github Emoji-Cheat-Sheet