The goal of this project is to perform an extract, transform and load (ETL) process to migrate data into a local Apache Spark cluster.
- Language: Python
- Technologies: Spark, GPG Encryption
- Decrypt local GPG-encrypted CSV files
- Load CSV tabular data into Spark DataFrames
- Save DF data to Parquet files
- Write query to determine average age
- Write query to determine age at the 75th percentile
- Ask questions to get clarification
- Install Apache Spark (note: if you experience too much trouble with setting up spark locally, then you may use duckdb instead)
- Write code in Python using data files and GPG keys stored in this repo
- Commit code to your repo and share link