duckdb-dbt

This repo is an example of using duckdb and dbt to build data.

The code is organized into three stages to show a progression of from simple queries to a modern data pipeline.

A toy dataset was created for the purposes of this repo called Farm To Market: a dataset about local agriculture.

Farm To Market

Dataset description

Farm To Market captures where local food is sold and might be grown in New York City.

This dataset combines the locations of farmers markets and potential farms (community gardens) to highlight availability and potential local suppliers of healthy produce in NYC.

Warning

This dataset's design is in-progress and the data dictionary below may not represent the final columns of a build.

Data Dictionary

farm_to_market table

Each row is a market and farm pair

market_name
farm_name
suitability_score
distance_ft
line_geometry_wgs84

markets table

Each row is a farmers market

market_name
accepts_ebt
distributes_health_bucks
open_year_round
address
borough
latitude
longitude
geometry_is_in_nyc
point_geometry_wgs84

farms table

Each row is a potential farm

type
farm_name
area_sqft
whole_lot
address
borough
bbl
latitude
longitude
point_geometry_wgs84
polygon_geometry_wgs84

Source data

Source data:

NYC Borough Boundaries [source]
NYC Farmers Markets [source]
GreenThumb Garden Info [source]
GreenThumb Block-Lot [source]

Development

Installations

VS Code [link] for an integrated development environment
git [link] for cloning this repo and installing bash terminal
Python 3.12 [link] for running python code

Setup

Setup a python virtual environment named .venv either using python -m venv .venv or using the VS Code command Python: create environment
Activate the virtual environment
```
source .venv/Scripts/activate
```

Install packages and confirm setup

python -m pip install --requirement setup/requirements.txt
pip list

Important

For NYC government employees, most work computers access the internet through a firewall. This causes issues when writing and running code (e.g. pip install fails).

See instructions below to allow code to access the internet on a city-issued PC.

Search for an select Edit environment variables for your account in the Start menu
Add two environment variables with the same value of http://bcpxy.nycnet:8080:
- http_proxy
- https_proxy

Stage 0: Sanity Check

Run a python script to confirm everything is setup

python -m stage_0.sanity_check

Stage 1: Load and explore data

Load and explore data from various sources

Source data

These datasets were chosen to show some of the ways source data can be imported with DuckDB.

PLUTO from the NYC Department of City Planning (source)
NYC Airbnb data (source)
Trip record data from the NYC Taxi and Limousine Commission (TLC) (source)

Steps

Download PLUTO from NYC Open Data here by navigating to Actions -> API -> Download file -> Export format: CSV
Rename the downloaded csv file to pluto.csv and move it to data/source_data/.
Run a python script to download the other 2 datasets programmatically and load all source data into a database:
```
python -m stage_1.load
```
Use the Jupyter notebook stage_1/explore.ipynb to explore the data

Stage 2: Pipeline

Use a data pipeline to build Farm To Market

Download all source data from their Open Data pages by navigating to Actions -> API -> Download file -> Export format: CSV

[!TIP] NYC Borough Boundaries must be downloaded as a geojson file by navigating to Export -> Download Geospatial Data -> GeoJSON.
Move the downloaded csv file to data/source_data/
. Run a python script to load all source data into a database:
```
python -m stage_2.load
```
(Optional) Use the Jupyter notebook stage_2/explore.ipynb to explore the source data

Run python scripts to transform and export data:

python -m stage_2.transform
python -m stage_2.export

Use the Jupyter notebook stage_2/analyze.ipynb to review and analyze the dataset

Stage 3: dbt pipeline

Use dbt to build to build Farm To Market

Install dbt packages and confirm setup:
```
dbt deps
dbt debug
```
Test source data:
```
dbt test --select "source:*"
```
Build the dataset:
```
dbt build
```

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
setup		setup
stage_0		stage_0
stage_1		stage_1
stage_2		stage_2
stage_3		stage_3
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duckdb-dbt

Farm To Market

Dataset description

Data Dictionary

Source data

Development

Installations

Setup

Stage 0: Sanity Check

Stage 1: Load and explore data

Source data

Steps

Stage 2: Pipeline

Stage 3: dbt pipeline

About

Releases

Packages

Contributors 2

Languages

License

damonmcc/duckdb-dbt

Folders and files

Latest commit

History

Repository files navigation

duckdb-dbt

Farm To Market

Dataset description

Data Dictionary

Source data

Development

Installations

Setup

Stage 0: Sanity Check

Stage 1: Load and explore data

Source data

Steps

Stage 2: Pipeline

Stage 3: dbt pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages