docker-spark

Overview

This is a learner project to understand how to write a Spark dataset to ElasticSearch and read it.

git clone https://github.com/panggd/docker-hadoop.git and run init-hadoop script
Run the init-elasticsearch script to deploy a docker container of Spark and ElasticSearch
Build the Java based Spark-ElasticSearch application to a Jar file
Docker cp the Jar file and CSV dataset to the container
Run Jar to process CSV dataset
Read results

Tech stack

Docker
ElasticSearch
Spark
Java
Gradle

data

This folder consists of a CSV dataset that describes the total attendance group by medical institutions and year.

elasticsearch

This folder consists of a Spark-ElasticSearch application that will process the CSV dataset to return the total attendance group by medical institutions.

docker-compose.yml

This is a script that will deploy the Spark and ElasticSearch docker images in a container.

Prerequsites

Download and install Docker. Follow the below guides.

https://docs.docker.com/install

Container resource requirements

You need to allocate more cpu and memory to each container for this project to work. I set mine to 4 cpu cores, 4gb ram in this project.

How to run

Start your docker daemon

This is really depend on your OS. For my case, it is just starting the Docker app.

Deploy docker-hadoop

git clone https://github.com/panggd/docker-hadoop.git
./init-hadoop

Deploy Spark and ElasticSearch containers

This will deploy the docker container holding Spark and ElasticSearch.

docker-compose up -d

Build the Spark-ElasticSearch application

Use your favorite IDE and build the jar in the spark folder.

# go to the output jar folder
zip -d spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF

Copy the Jar and dataset into the Hadoop & Spark container

# Go to data folder and copy the dataset to hadoop namenode container
# bash into namenode container
# copy the dataset to hdfs
docker cp hospital-and-outpatient-attendances.csv \
<hadoop_namenode_container_id>:hospital-and-outpatient-attendances.csv

docker exec -it <hadoop_namenode_container_id> bash

hdfs dfs -mkdir /data
hdfs dfs -put hospital-and-outpatient-attendances.csv /data/

# Go to spark folder
docker cp elasticsearch.jar <spark_master_container_id>:elasticsearch.jar

Process the dataset and enjoy the output results

# Get into the Hadoop cluster server
docker exec -it <spark_master_container_id> bash

# Process the dataset
java -cp elasticsearch.jar SparkElasticApplication hdfs://namenode:9000/data/hospital-and-outpatient-attendances.csv

Housekeeping

Here are some housekeeping tips if you are on a low memory resource machine like me.

# This is to have a clean state of your docker environment
docker stop $(docker ps -a -q) && \
docker system prune -a

TODO

Create and integrate a REST API
Extract the output result to the REST API

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
elasticsearch		elasticsearch
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docker-spark

Overview

Tech stack

data

elasticsearch

docker-compose.yml

Prerequsites

Download and install Docker. Follow the below guides.

Container resource requirements

How to run

Start your docker daemon

Deploy docker-hadoop

Deploy Spark and ElasticSearch containers

Build the Spark-ElasticSearch application

Copy the Jar and dataset into the Hadoop & Spark container

Process the dataset and enjoy the output results

Housekeeping

TODO

About

Releases

Packages

Contributors 2

Languages

panggd/docker-elasticsearch

Folders and files

Latest commit

History

Repository files navigation

docker-spark

Overview

Tech stack

data

elasticsearch

docker-compose.yml

Prerequsites

Download and install Docker. Follow the below guides.

Container resource requirements

How to run

Start your docker daemon

Deploy docker-hadoop

Deploy Spark and ElasticSearch containers

Build the Spark-ElasticSearch application

Copy the Jar and dataset into the Hadoop & Spark container

Process the dataset and enjoy the output results

Housekeeping

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages