Public dbt instance to aid in data transformation for analytics purposes. If you're interested in setting up your own dbt project, you can find detailed instructions in the dbt documentation.
Pay attention, it is very important to know if your modification to this repository is a release (breaking changes), a feature (functionalities) or a patch(to fix bugs). With that information, create your branch name like this:
major/<branch-name>
minor/<branch-name>
patch/<branch-name>
If branch is already made, just rename it before passing the pull request.
dbt is composed of different moving parts working harmoniously. All of them are important to what dbt does — transforming data. When you execute dbt run
or dbt build
, you are running a model that will transform your data without that data ever leaving your warehouse.
dbt run
will execute the compiled sql models without running tests, snapshots, nor seeds (if there are any). More information about dbt run
can be found here
dbt build
will do everything dbt run
does plus run tests, snapshots, and seeds (if there are any). More information about dbt build
can be found here
The top level of a dbt workflow is the project. A project generally consists of a project configuration .yml
(in this case the dbt_project.yml) and a directory of the models. The dbt_project.yml
tells dbt the project context, and the models let dbt know how to build a specific data set. These models simplify analytics by generating data mart tables.
A model is a single .sql
file containing a final select statement for the table or view. The table/view name is the same as the file name. A project can have multiple models where a model can be classified as a staging, intermediate, or mart table/view.
In dbt, you can configure the materialization of your models. Materializations are strategies for persisting dbt models in a warehouse. There are five types of materializations built into dbt. They are:
- View: your model is rebuilt as a view on each run
- Table: your model is rebuilt as a table on each run
- Incremental: allows dbt to insert or update records into a table since the last time that dbt was run
- Ephemeral: these are not directly built into the database. instead, dbt will interpolate the code from this model into dependent models as a common table expression
- Materialized View: allows the creation and maintenance of materialized views in the target database
Note: More information about materializations can be found here
Note: More information about dbt projects can be found here.
The stellar-dbt-public project follows a staging
, intermediate
, and marts
approach to modeling.
- Staging The purpose of the staging layer is to receive the raw data from the source and prepare it for further transformations and analysis.
What to do in staging:
- Column selection
- Renaming columns
- Definition of data types (casting columns to String, Numeric...)
- Flattening of structured objects
- Initial filters
- Basic cleanup, like replacing empty strings with NULL, for example
What not to do in staging:
- Joins
- Aggregations
Note: More information about
staging
layer can be found here.
- Intermediate In the intermediate layer, the preparation for directing the data to the marts takes place. Not every model in the staging layer will become an intermediate model. This is where it is possible to combine different models to start assembling business rules.
What to do in intermediate:
- Joins between different staging or intermediate models
- Aggregations or re-graining to fan out or collapse columns to the desired granularity
- Complex logic such as the integration of business logic, rules, or metrics
What not to do in intermediate:
- Ingestion of raw data
- Dimensional modeling (creation of fact, dimension, or aggregate tables)
- The actions listed in staging
Note: More information about
intermediate
layer can be found here.
- Marts
The marts layer is where users can access final dimensional modeling tables. Each model will be accompanied by a
.yml
file with the same name. The.yml
file contains the descriptions and tests for the model and its columns.
What to do in marts:
- Organization of data into dimension, fact, or aggregate tables
- Mart-specific tweaks
- Final joins for staging and intermediate models
- End user documentation
- Final testing
What not to do in marts:
- Data cleaning
- The actions listed in staging or intermediate
Note: More information about
marts
layer can be found here.
Name | Description |
---|---|
Source | Stores the raw data that will be used as inputs for the dbt transformations. |
Staging | Stages the pre-processed or cleaned data before performing the transformations. |
Intermediate | Contains the transformed and processed data. |
Marts | Houses the final data models or data marts, which are the end results of the dbt project. |
Docs | Stores documentation related to your dbt project. |
Macros | Contains reusable SQL code snippets known as macros. |
Tests | Contains defining tests to validate the accuracy and correctness of the data transformations. |
Tests are assertions made about models and other resources in the dbt project. When you run dbt test
or dbt build
, dbt will tell you if each test in your project passes or fails.
There are three different test types:
- Singular data tests: a SQL query that returns failing rows
- Generic data tests: is a parameterized query that accepts arguments. The test query is defined in a special test block (like a macro)
- Unit tests: validates SQL modeling logic on a small set of static inputs before materializing the full model
Note: More information about
data tests
can be found here
Note: More information about
unit tests
can be found here
Clone the git repository locally
git clone https://github.com/stellar/stellar-dbt-public.git
- Create a new
.env
file by copyingexample.env
cp example.env .env
- Set the various
DBT_*
variables (e.g.DBT_DATASET
,DBT_PROJECT
) - Execute
setup.sh
to create a virtual environment and install required dbt dependencies
source setup.sh
- Set up OAuth by running
gloud init
and following the account information prompts - To verify if all libraries have been correctly installed, use the command
pip list
- Execute the command
dbt debug
to ensure that all configurations are working correctly
Note: You can find instructions to install the
GCP CLI
here
Note: More information about
dbt core
setup can be found here
Note: You may need to modify
profiles.yml
anddbt_project.yml
if any of the above commands fail
profiles.yml docs >
dbt_project.yml docs
Executing a dbt model is the same as executing a SQL script. There are many ways to run dbt models:
- Run the entire project with
dbt <run or build>
- Run tagged models, paths, or configs with
dbt <run or build> --select tag:tag_1 tag_2 tag_3
- Run specific models with
dbt <run or build> --select model_1 model_2 model_3
Note: this runs only the specified models and none of its dependencies
- Run the model with its downstream and/or upstream models by adding a
+
to the beginning (downstream) and/or the end (upstream) of the model name
dbt <run or build> --select +model_1+
Note: dbt supports
--excluding
,--defer
,--target
, and many other selection types. Please refer to the node selection syntax documentation.
Execute tests with the commands dbt test
or dbt build
with the desired node selection syntax. This will run schema, data, and unit tests unless they are explicitly excluded. More information about tests can be found here.
In order to enable other repositories to work with and build on top of this dbt repo, it was configured as a package. Due to this, commiting to the repository requires a few extra steps to ensure git tagging is consistent, so that changes will not break any code downstream. When commiting changes, there are 3 main version changes that can be applied to the repository, as follows:
Change | Description |
---|---|
Major | Version when making incompatible changes |
Minor | Version change when adding functionalities in a backward compatible manner |
Patch | Version change when making backward compatible bug fixes |
In order to apply git tagging properly, the last commit message from a repo must contain a hashtag(#) followed by the change type present in that commit. This will allow github to detect the type of change being pushed and increment the version accordingly. For example:
'Fix trade-agg bug #patch'
TODO