GitHub events metrics

This is a Data Engineering project which creates a data pipeline for monitoring GitHub events in realtime (updates every 1 hr) Github updates the dataset to the last hour. Details: https://www.gharchive.org/

Github receives ~100k events every hour. Events like Push, PullRequest, Fork, etc. To monitor the business is performing well we defined the following KPIs to be visible in realtime.

KPIs:

Total GitHub events stored.
Events per hour and by type.
Percentage of each event type.
Most frequently used words in commits.

These KPIs were also chosen as an example for designing a data pipeline with frequent updates.

GitHub releases a new Event archive every hour. Our system also pulls data every hour, 15min after Github release. Our visualization dashboard will be up to date.

Steps:

System architecture

Everything is Dockerized, so it can be deployed to any platform easily.

Exploratory data analysis

See the analysis in jupyter notebook: data analysis

loaded data for 1 hour from GitHub API
cleaned up the data and created a DataFrame
decided to create dashboards for following data:
- Count of event types
- Distribution of commit count for PushEvents
- Most common words in commits

These 3 dashboards will be displayed in the final visualizations once the data pipeline is complete.

Steps to reproduce:

1. Terraform

To setup Google Cloud Platform: BigQuery and GCS run following commands from /terraform_gcp/terraform folder:

# login to gcloud cli
gcloud auth application-default login   

terraform init

terraform apply

# to destroy
terraform destroy

2. BigQuery partitioning and clustering

github_data_clean table is partitioned on created_at_timestamp field using DBT. This field is used to group event count by hour.

Optionally it can be created manually:

CREATE TABLE de_final_data.github_data_clean
(
  id STRING,
  created_at_timestamp TIMESTAMP,
  type STRING,
  actor_id_string STRING,
  actor_login STRING,
  public BOOLEAN,
  repo_id_string STRING,
  repo_name STRING,
  payload_commits STRING
 )
PARTITION BY 
  TIMESTAMP_TRUNC(created_at_timestamp, HOUR)
CLUSTER BY
  id

de_final_data table was recreated manually. word is grouped to count it's frequency.

CREATE TABLE de_final_data.words_data
(
    github_event_foreign_key STRING,
    word STRING
 )
CLUSTER BY
  github_event_foreign_key,
  word

3. Airflow

Airflow has a DAG with multiple jobs for Github events: - download github file every hour - convert from .json.gz to .parquet - upload data to Google Storage - transfer data from Google Storage to Google Bigquery - delete local temp files

And another DAG with one job:
- transfer GitHub commit Words data from Google Storage to Google Bigquery

Run locally:

cd airflow

#review .env file. 
#Ensure your google credentials are correct. This file must exist: ~/.google/credentials/google_credentials.json.

docker-compose up

# once started, open in browser:  http://localhost:8080/home    and start both DAGs

4.DBT

DBT is used to create BigQuery views of aggregated data.

DBT image has a starting script to run updates every periodically.

BigQuery table: github_data_clean with better field types, partitioning and clustering.
BigQuery views: events_per_hour_by_type, events_per_minute, count, most_used_commit_words.

Run locally:

docker build -t dbt_transform .

# requires correct google_credentials.json path
docker run --rm --name dbtTransform -it -v $PWD:/dbt -v ..../google_credentials.json:/dbt/google_credentials.json -v profiles.yml:/root/.dbt/profiles.yml dbt_transform

5.Spark

Spark is used to load Github event data from BigQuery, extract commit messages, break messages in words and list most common words. The result is sent back to GCS as parquet file. Airflow job will take it back to BQ as a separate table.

Spark image has a starting script to run updates every periodically.

Run locally:

# Add google_credentials.json and p12 to this folder.

docker build -f Dockerfile -t spark_3_1_datamech . 

# Dockerfile3 can also be used but main.py file has to be manually submitted. See Dockerfile3 for comments.

docker run --rm --name spark_3_1_datamech -it spark_3_1_datamech

Visualizations:

Visualizations are done using Google Data Studio and they update in realtime (every hour once data is ingested from GitHub).

View visualization Live: https://datastudio.google.com/s/urfSamU8nfQ

Cloud

Deploayed to Cloud following the "Steps to reproduce" from above.

I used GCP Compute Engine as a virtual machine instance. All docker images running together.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
airflow		airflow
data-analysis		data-analysis
dbt		dbt
spark		spark
terraform_gcp		terraform_gcp
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub events metrics

Steps:

System architecture

Exploratory data analysis

Steps to reproduce:

1. Terraform

2. BigQuery partitioning and clustering

3. Airflow

4.DBT

5.Spark

Visualizations:

Cloud

About

Releases

Packages

Languages

License

razorcd/data-engineering-project

Folders and files

Latest commit

History

Repository files navigation

GitHub events metrics

Steps:

System architecture

Exploratory data analysis

Steps to reproduce:

1. Terraform

2. BigQuery partitioning and clustering

3. Airflow

4.DBT

5.Spark

Visualizations:

Cloud

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages