Demo for running an analytics pipeline locally using Apache Airflow, dbt, and BigQuery (BQ).
We're using Google Cloud Platform (GCP) and BQ for the purposes of this demo, but if you're familiar with another public cloud or warehouse, you can substitute where applicable (i.e. dbt-bigquery
-> dbt-snowflake
).
Clone the repo and open a terminal from the cloned directory
git clone https://github.com/chelseybeck/airflow-dbt-demo.git
This demo is still under development and updated regularly, so pull often (at least daily) to get the latest
git pull
-
GCP Service Account (+ key) with the following permissions (allows Airflow to connect to BQ):
- BigQuery Data Editor
- BigQuery Data Viewer
- Bigquery Job User
The service account can be created manually in the GCP UI or locally using the Terraform module in the
terraform
directory - see details
We're using Poetry (installation is a pre-requisite)
-
Install dependencies
poetry install
-
Open Poetry Shell Note: every new terminal should run in this virtual environment (unless using a customized venv). Learn more
poetry shell
-
Copy the example
.env
filecp .env.example .env
-
Replace the file paths in
.env
with your system paths
-
Update Airflow configuration:
-
Find Airflow's Home directory
airflow info
-
Update DAG directory
Navigate to Airflow's home directory and open the
airflow.cfg
file. I.e.nano ~/airflow/airflow.cfg
-
Change the
dags_folder
path to theairflow-dbt-demo
code repository and save. For example:dags_folder = /Users/username/airflow-dbt-demo/dags
-
Optional - remove DAG examples. When set to true, many examples are provided on the home page in the UI when Airflow is started
load_examples = False
-
-
Initialize the database
airflow db migrate
-
Create a user
# create an admin user airflow users create \ --username admin \ --firstname Peter \ --lastname Parker \ --role Admin \ --email [email protected]
-
Add Google Cloud connection - allows connection to BQ - Airflow picks up credentials from your
.env
fileairflow connections add 'google_cloud_default' \ --conn-type 'google_cloud_platform' \ --conn-extra "{\"extra__google_cloud_platform__project\": \"$GCP_PROJECT\"}"
-
Start the Airflow webserver:
airflow webserver --port 8080
Access the Airflow UI at
localhost:8080/home
& login airflow home -
Start the scheduler
airflow scheduler
-
Run the DAGs from Airflow's UI:
-
Click on the DAG
spotify_ingestion_dag
- Loads Spotify data from a csv file into BigQuery
-
Click the 'play' button to trigger the DAG (upper right corner)
-
Click on the DAG
daily_dbt_dag
- Runs dbt jobs (models directory)
-
Click the 'play' button to trigger the DAG (upper right corner)