This is a collection (growing hopefully as time goes on) providing tips and tricks to ensure your experience building and maintaining Streaming Delta Lake applications (and the tables that power them) is absolutely joyful, even when the shit hits the fan :). Remember, when all else fails to take a deep breath, count to 5, and dig into the following content.
Take a look at the outline provided in /hitchhikers_guide/README.md to learn how to use the Guide in your Delta Lake streaming adventures.
Note: For x86_64 (linux/amd64) or (linux/arm64) use the common docker-compose.yaml Docker Image: This is using the
newfrontdocker/delta-docker:3.0.0
image. This will be replaced with the officialdelta-docker
image after delta-io/delta-docs#60 is merged and the image is pushed.
cd hitchhikers_guide && docker compose up
Note: The
docker-compose.yaml
includes settings to clamp the amount of local resources available for data processing (upper limit: 4 cpu cores, 16GB ram). If you want to increase the number of cores to reflect your laptop,desktop,server then remember to save at least 1 cpu core and at least 1GB ram for the root OS.
deploy:
resources:
limits:
cpus: '4'
memory: 16G
reservations:
cpus: '1'
memory: 4G
Then head into the Jupyter Lab environment and start exploring.
The main content will be located in http://127.0.0.1:8888/lab/tree/hitchhikers_guide
- ECommerce Data - the dataset used in the hitchhikers guide is a subset of the data from the original 5gb dataset.
Note: The full datasets are not included in this repo. If you'd like to explore large datasets like the ecommerce behavior dataset, then please download it from Kaggle.