Docker compose stack to run an open-source lakehouse on a single machine using Prefect, Iceberg, Trino and Superset.
In order to run the stack a Linux, Windows or MacOS computer with the following dependencies installed is needed:
- docker
- docker-compose
- bash
- wget
Notes:
- Minimum Docker resources: 4 cores and 8Gb of RAM
- The following ports should be available: 80, 4200, 5006, 8088
- For Windows WSL is required
- Clone this repository and
cd
into it. - Use the
lakehouse.sh
bash script to start/stop the stack. Options:lakehouse.sh start
: initialize environment (if needed) and start all services.lakehouse.sh stop
: stop all services.lakehouse.sh restart
: restart all services.lakehouse.sh status
: displays services' status.lakehouse.sh reset
: reset environment. All ingested data will be deleted.lakehouse.sh build
: rebuilds docker images.
The environment is configured by default to run on Docker Desktop. When running on Linux, you will need to edit .env
file and set the value of variable HOST_OR_IP
to the machine's ip or dns.
Example:
# HOST_OR_IP=host.docker.internal
HOST_OR_IP=10.10.10.10
No other config changes are needed. lakehouse.sh start
initializes all services including databases, Prefect deployment registration, etc.
Initizalization will automatically run the first time the environment is started and the first time the enviroment is started after an environment reset.
The LakeHouse stack contains the following services:
traefik
: Reverse proxy.iceberg
: Iceberg metadata catalog.minio
: table storage for Iceberg.trino
: query engine.superset
: data exploration and visualization.jupyter
: notebook configured to access Iceberg and Trino.prefect-server
: workflow engine server used for data ingestion.prefect-worker
: workflow engine worker.postgres
: SQL database used by Prefect, Superset and Iceberg metadata catalog.
Some of the services provide user interfaces. These are their urls:
- Prefect: http://localhost:4200
- Superset: http://localhost:8088 (admin/admin)
- Jupyter: http://localhost:5006
Alteratively, the following urls can be used after updating your /etc/hosts
.
- Jupyter: http://jupyter.lakehouse.localhost
- Prefect: http://prefect.lakehouse.localhost
- Superset: http://superset.lakehouse.localhost
# /etc/hosts
127.0.0.1 jupyter.lakehouse.localhost prefect.lakehouse.localhost superset.lakehouse.localhost trino.lakehouse.localhost minio.lakehouse.localhost
When the enviroment is initialized, the Prefect Flow data-to-dashboard
is automatically registered. This flow ingests csv
or parquet
files using DuckDB and creates a simple dashboard for the ingested data.
A sample dataset is provided and can be ingested following the following steps:
- Navigate to Prefect's UI -> Deployments.
- Click on the three dots to the right of the
data-to-dashboard:dev
deployment. - Select
Quick run
.
- After the flow run is started, you can navigate to the
Flow Runs
section and see the flow's logs.
- Once the flow run has successfully completed, navigate to Superset UI. A dashboard with a sample table chart should be available.
A sample jupyter notebook is provided with code to ingest and query a parquet file:
- Navigate to Jupiter's UI.
- Open the notebook `notebooks/lakehouse.ipynb.
- Run the notebook.
- Once the notebook successfully ran, navigate to Superset UI. A dashboard with a sample table chart should be available.
Custom csv or parquet files can be ingested with Prefect:
- Copy the data file into folder
data/datasets
. - Navigate to Prefect's UI -> Deployments.
- Click on the three dots to the right of the
data-to-dashboard:dev
deployment. - Select
Custom run
. - Set the url for your file. It should start with
/lakehouse-poc/datasets/
.
- Set database name (Iceberg namespace) and table name.
- All other parameters are optional and apply to csv files only.
- Click on
Submit
. - After the flow run is started, you can navigate to the
Flow Runs
section and see the flow's logs. - Once the notebook successfully ran, navigate to Superset UI. A dashboard with a sample table chart should be available. A
datetime
column is required for the dashboard to work.
You can update the ingestion code for your own use case by updating file ./data/pipelines/ingestion.py
. The code is mounted into the docker container, so next time the flow is run it will pick up the updates. If new dependencies are needed, add them to ./data/pipelines/requirements.txt
, run ./lakehouse.sh build
and restart the enviroment with ./lakehouse.sh restart
.
Superset provides lots of different charts, you can create new charts and add them to dashboards. The ingestion pipeline sets up a dataset for your csv/parquet file on top of which new charts can be created.
- Add Spark for ingestion and query.
- Add option to use Nessie as Iceberg catalog. Currently waiting for pyiceberg to add support for it.