Big Data Workshop

Placeholder for something meaningful.

Getting Started: (Mac Users)

Prerequisites:

poetry is installed

which poetry

If poetry not found, install it using pipx:

brew install pipx
pipx install poetry
pipx ensurepath
source ~/.zshrc

pyenv is installed
```
pyenv --version
```
If not, install it
```
brew install pyenv
```

Getting Started: (Windows Users)

Install scoop
Use scoop to install poetry and make:
```
scoop install pipx
scoop install make
```
Use pipx to install poetry:
```
pipx install poetry
pipx ensurepath
```
Install pyenv-win:
Reload your terminal

Getting Started: (All Users)

Use pyenv to install python 3.11.6
```
pyenv install 3.11.6
```
Install everything with: make:
```
make setup
```

Install pre-commit hooks:

pipx install pre-commit
pipx ensurepath
pre-commit install

Install pre-commit hooks: pre-commit run -a

Make Data

We're going to create "a lot" of data and store it in S3

make create-data

View data here: https://yb-big-data-workshop-1.s3-us-west-2.amazonaws.com/index.html

Compressing data

Compressing the data decreases its size by 10X. We can compress it when writing directly in polars using pgzip.open(...). (Note: pgzip is the parallel implementation of gzip.)

Resources:

This article is inspiration: Python One Billion Row Challenge — From 10 Minutes to 4 Seconds

Actual Course:

Why is this course important?

If data is too big, don't throw in the towel, you can process it using this ways
If you already process big data, maybe there are more efficient or cost effective ways to do it
Build the right solution for the right problem.

If you take this course:
You will know how to process big data in multiple ways and which is the best choice for you.

Objective:

We are receiving weather station data. We want to determine the average temperature for any given range of dates. (Can be a single date, or many dates.)

Would be wasteful to re-calculate each time, or calculate every combination. Instead, we create a data mart which looks like this:

Date	Station_name	Min	Mean	Max	Count
2024-05-25	Alexandria	8	20	26	20
2024-05-26	Alexandria	6	21	26	10
2024-05-27	Alexandria	9	19	27	15

This allows us to handle late arriving data. (Assuming we can get data for a previous date at any point.)

We'll materialize these partial calculations and store them somewhere. Then, whenever someone needs to query for a certain date, we'll be able to give them their result.

Themes:

Multiple ways to process a file
1. in memory
2. in chunks
3. streaming
4. map reduce
5. massively parallel processing (MPP) [out of scope]
Big data is IO bound (when downloading/uploading big files)
1. Compress when possible
2. Move compute closer to the data (private network / VPC / access point / or, in the actual data center)
Don't do things twice
1. Caching (via disk) - don't download a file twice
2. Incrementalism: use your data to determine offsets - don't process data twice
Orchestrate pipelines instead of executing straight code
1. Simplifies complex systems
2. Allows delegation to other machines
Big powerful tools can be expensive - but sometimes they are worth it
1. Perhaps demonstrate how to process this all in Snowflake or BigQuery

Coding

Start with installation
1. Prove things work with src/start_here/main.py
Download a small file manually and process in pandas
1. in memory
2. in chunks
3. in stream
Process in polars
Mention dask
Now, we automate the download of the file (can read directly from url)
1. but we have a new problem - process begins downloading the file all over again
2. So we use a framework to "look before you leap", downloads a file if needed

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Workshop

Getting Started: (Mac Users)

Getting Started: (Windows Users)

Getting Started: (All Users)

Make Data

Compressing data

Resources:

Actual Course:

Why is this course important?

Objective:

Themes:

Coding

About

Releases

Packages

Languages

ybressler/big-data-workshop-may-2024

Folders and files

Latest commit

History

Repository files navigation

Big Data Workshop

Getting Started: (Mac Users)

Getting Started: (Windows Users)

Getting Started: (All Users)

Make Data

Compressing data

Resources:

Actual Course:

Why is this course important?

Objective:

Themes:

Coding

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages