This is a learner project to understand how to ingest a semi-structured data and query with Spark.
- Run the init-spark script to deploy a docker container of Hadoop cluster server and Spark
- Build the Java based Spark application to a Jar file
- Docker cp the Jar file and CSV dataset to Spark container
- Run Jar to process CSV dataset
- Read results
- Docker
- Spark
- Java
- Gradle
This folder consists of a CSV dataset that describes the total attendance group by medical institutions and year.
This folder consists of a Spark application that will process the CSV dataset to return the total attendance group by medical institutions.
This is a script that will git clone the Spark docker GitHub project, deploy a docker container of Spark.
https://docs.docker.com/install
This is really depend on your OS. For my case, it is just starting the Docker app.
This will deploy the docker container holding Spark.
./init-spark.sh
Use your favorite IDE and build the jar in the spark folder.
# go to the output jar folder
zip -d spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
# Go to data folder
docker cp hospital-and-outpatient-attendances.csv \
<spark_server_container_id>:hospital-and-outpatient-attendances.csv
# Go to spark folder
docker cp spark.jar <spark_server_container_id>:spark.jar
# Get into the Spark container
docker exec -it <spark_server_container_id> bash
# Process the dataset
java -cp spark.jar SparkApplication hospital-and-outpatient-attendances.csv
Here are some housekeeping tips if you are on a low memory resource machine like me.
# This is to have a clean state of your docker environment
docker stop $(docker ps -a -q) && \
docker system prune -a
- Create and integrate a REST API
- Extract the output result to the REST API