Skip to content

sparkfish/python-spark-sample

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark ETL Sample Project

The goal of this project is to perform an extract, transform and load (ETL) process to migrate data into a local Apache Spark cluster.

  • Language: Python
  • Technologies: Spark, GPG Encryption

ETL Process

  1. Decrypt local GPG-encrypted CSV files
  2. Load CSV tabular data into Spark DataFrames
  3. Save DF data to Parquet files
  4. Write query to determine average age
  5. Write query to determine age at the 75th percentile

Approach

  1. Ask questions to get clarification
  2. Install Apache Spark (note: if you experience too much trouble with setting up spark locally, then you may use duckdb instead)
  3. Write code in Python using data files and GPG keys stored in this repo
  4. Commit code to your repo and share link

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published