Skip to content

A demo for Ventura Analytics meetup - scheduling dbt jobs with Airflow

License

Notifications You must be signed in to change notification settings

chelseybeck/airflow-dbt-demo

Repository files navigation

Analytics Pipeline | dbt, Airflow, and Bigquery

Demo for running an analytics pipeline locally using Apache Airflow, dbt, and BigQuery (BQ).

We're using Google Cloud Platform (GCP) and BQ for the purposes of this demo, but if you're familiar with another public cloud or warehouse, you can substitute where applicable (i.e. dbt-bigquery -> dbt-snowflake).

Getting Started

Clone this repo

Clone the repo and open a terminal from the cloned directory

git clone https://github.com/chelseybeck/airflow-dbt-demo.git

This demo is still under development and updated regularly, so pull often (at least daily) to get the latest

git pull

Prerequisites

  • Python 3.11+

  • Poetry

  • Git

  • GCP account

  • GCP Service Account (+ key) with the following permissions (allows Airflow to connect to BQ):

    • BigQuery Data Editor
    • BigQuery Data Viewer
    • Bigquery Job User

    The service account can be created manually in the GCP UI or locally using the Terraform module in the terraform directory - see details

Python Environment Setup

We're using Poetry (installation is a pre-requisite)

  1. Install dependencies

    poetry install
  2. Open Poetry Shell Note: every new terminal should run in this virtual environment (unless using a customized venv). Learn more

    poetry shell

Set up Environment Variables

  1. Copy the example .env file

    cp .env.example .env
  2. Replace the file paths in .env with your system paths

Initialize Airflow and Run DAG

  1. Update Airflow configuration:

    • Find Airflow's Home directory

      airflow info
    • Update DAG directory

      Navigate to Airflow's home directory and open the airflow.cfg file. I.e.

      nano ~/airflow/airflow.cfg
    • Change the dags_folder path to the airflow-dbt-demo code repository and save. For example:

      dags_folder = /Users/username/airflow-dbt-demo/dags
      
    • Optional - remove DAG examples. When set to true, many examples are provided on the home page in the UI when Airflow is started

      load_examples = False
      
  2. Initialize the database

    airflow db migrate
  3. Create a user

    # create an admin user
    airflow users create \
    --username admin \
    --firstname Peter \
    --lastname Parker \
    --role Admin \
    --email [email protected]
  4. Add Google Cloud connection - allows connection to BQ - Airflow picks up credentials from your .env file

    airflow connections add 'google_cloud_default' \
    --conn-type 'google_cloud_platform' \
    --conn-extra "{\"extra__google_cloud_platform__project\": \"$GCP_PROJECT\"}"
  5. Start the Airflow webserver:

    airflow webserver --port 8080

    Access the Airflow UI at localhost:8080/home & login airflow home

  6. Start the scheduler

    airflow scheduler
  7. Run the DAGs from Airflow's UI:

  • Click on the DAG spotify_ingestion_dag

    • Loads Spotify data from a csv file into BigQuery
  • Click the 'play' button to trigger the DAG (upper right corner)

  • Click on the DAG daily_dbt_dag

  • Click the 'play' button to trigger the DAG (upper right corner)

About

A demo for Ventura Analytics meetup - scheduling dbt jobs with Airflow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published