Skip to content

Bigdata project for processing and storage Vietnam job streaming data from recruitment websites

Notifications You must be signed in to change notification settings

Noxshine/vietnam-jobs-analysis

Repository files navigation

BigData project: Analizing online Vietnam job

Buid a bigdata storage and processing system for analizing Vietnam job data from recruitment websites Careerlink, Careerbuilder, TopCV,... Data is crawled by Scrapy and processed using Spark streaming. Spark cluster created inside Docker.

Pipeline

Data crawled from 2 websites Careerbuilder and Careerlink by Scrapy is written to a Kafka cluster. Then spark streaming job subcribe to these topic to read data and make processing. After that, data will be written to Mongodb container for further purposes. Kafka, Spark, Mongodb are created in containers running on Docker

Alt text

Installation

  • Install all requirement libraries
pip install -r requirements.txt
  • Download Docker desktop Docker desktop

  • Build spark cluster, kafka and mongodb containers

docker-compose up
  • Check spark cluster web ui on "localhost:8080"

Alt text

  • Create kafka topic

For example: create topic careerbuilder in kafka-1 broker

docker exec -it  kafka-1 /bin/sh kafka-1.sh --bootstrap-server localhost:9092 --topic careerbuilder --create --partitions 3 --replication-factor 1

Running

  • Copy py. files of each job to a spark workder dir
docker cp vietnam-jobs-analysis/spark/Careerbuilder vietnam-jobs-analysis-spark-worker-1-1:opt/bitnami/spark

docker cp vietnam-jobs-analysis/spark/Careerlink vietnam-jobs-analysis-spark-worker-2-1:opt/bitnami/spark
  • Submmit job to spark cluster
docker exec -it vietnam-jobs-analysis-spark-worker-1-1 /bin/bash spark-submit --master spark://spark-master:7077 --conf spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0,org.apache.kafka:kafka-clients:3.5.0,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 --conf spark.jars.ivy=/tmp/binami/pkg/cache --num-executors 2 --driver-memory 512m --executor-memory 512m --executor-cores 2 Careerbuilder/CareerbuilderMain.py

In spark web UI there are job submitted is running

  • Run scrapy to crawl data
scrapy crawl careerbuilder
scrapy crawl careerlink
  • Check Mongodb container for output
docker exec -it mymongodb bin/bash
mongosh
show dbs
use job-analysis
db.careerbuilder.find()

About

Bigdata project for processing and storage Vietnam job streaming data from recruitment websites

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published