This is a learner project to understand how to write a Spark dataset to ElasticSearch and read it.
- git clone https://github.com/panggd/docker-hadoop.git and run init-hadoop script
- Run the init-elasticsearch script to deploy a docker container of Spark and ElasticSearch
- Build the Java based Spark-ElasticSearch application to a Jar file
- Docker cp the Jar file and CSV dataset to the container
- Run Jar to process CSV dataset
- Read results
- Docker
- ElasticSearch
- Spark
- Java
- Gradle
This folder consists of a CSV dataset that describes the total attendance group by medical institutions and year.
This folder consists of a Spark-ElasticSearch application that will process the CSV dataset to return the total attendance group by medical institutions.
This is a script that will deploy the Spark and ElasticSearch docker images in a container.
https://docs.docker.com/install
You need to allocate more cpu and memory to each container for this project to work. I set mine to 4 cpu cores, 4gb ram in this project.
This is really depend on your OS. For my case, it is just starting the Docker app.
git clone https://github.com/panggd/docker-hadoop.git
./init-hadoop
This will deploy the docker container holding Spark and ElasticSearch.
docker-compose up -d
Use your favorite IDE and build the jar in the spark folder.
# go to the output jar folder
zip -d spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
# Go to data folder and copy the dataset to hadoop namenode container
# bash into namenode container
# copy the dataset to hdfs
docker cp hospital-and-outpatient-attendances.csv \
<hadoop_namenode_container_id>:hospital-and-outpatient-attendances.csv
docker exec -it <hadoop_namenode_container_id> bash
hdfs dfs -mkdir /data
hdfs dfs -put hospital-and-outpatient-attendances.csv /data/
# Go to spark folder
docker cp elasticsearch.jar <spark_master_container_id>:elasticsearch.jar
# Get into the Hadoop cluster server
docker exec -it <spark_master_container_id> bash
# Process the dataset
java -cp elasticsearch.jar SparkElasticApplication hdfs://namenode:9000/data/hospital-and-outpatient-attendances.csv
Here are some housekeeping tips if you are on a low memory resource machine like me.
# This is to have a clean state of your docker environment
docker stop $(docker ps -a -q) && \
docker system prune -a
- Create and integrate a REST API
- Extract the output result to the REST API