Add deployment scripts and documentation (#10)

Signed-off-by: Apekshit Sharma <[email protected]>
blockchain-etl · May 17, 2020 · 4ea98c4 · 4ea98c4
1 parent 0aba949
commit 4ea98c4
Show file tree

Hide file tree

Showing 7 changed files with 261 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@ Hedera ETL populates BigQuery dataset with transactions and records generated by
 the pipeline.
 
 - Deduplication: The above ingestion pipeline gives at-least-once guarantee for persisting transaction into BigQuery.
-Duplicates, if inserted, are removed using a cron job.
+Duplicates, if inserted, are removed using a deduplication task.
 
 ## Setup
 
@@ -32,7 +32,8 @@ info about columns.
 
 #### Creating tables
 
-Use `bq` CLI to create the tables.
+`bq` CLI is needed to create the tables. [/create-tables.sh](scripts/create-tables.sh) can be used to create all the
+tables together. Alternatively, tables can be created individually using the commands below.
 
 ###### Transactions table
 
@@ -49,7 +50,7 @@ bq mk \
 
 ###### Errors table
 
-If an error is encountered when inserting a transaction into BigQuery, then the insert it retried. However, errors
+If an error is encountered when inserting a transaction into BigQuery, then the insert is retried. However, errors
 for which retry would not help (for example, table row violating the schema), are not tried again and instead logged
 into errors table.
 
@@ -61,14 +62,28 @@ bq mk \
   hedera-etl-dataflow/src/main/resources/errors_schema.json
 ```
 
-### Apache Beam Pipeline
+##### Deduplication state table
+Deduplication task's state is stored in BigQuery table for persistence. That's because the task already relies on
+BigQuery to be available, and adding dependency on a persistent volume or another database would be not as good.
+
+```bash
+bq mk \
+  --table \
+  --description "BigQuery deduplication task state" \
+  --description "Hedera Dedupe " \
+  project_id:dataset.dedupe_state \
+  hedera-dedupe-bigquery/state-schema.json
+```
+
+
+### ETL to BigQuery
 
 ###### Requirements
 
 1. BigQuery tables for transactions and errors should exist
 2. PubSub topic should exist
-3. If using GCP Dataflow, a user or service account having following roles - BigQuery Data Editor, Dataflow Worker, and
-   Pub/Sub Subscriber
+
+For requirements to deploy on GCP Dataflow, refer [deployment](docs/deployment.md).
 
 ###### Common parameters
 
@@ -98,7 +113,7 @@ mvn compile exec:java -PdirectRunner -Dexec.args=" \
 
 ```bash
 BUCKET_NAME=... # Set your bucket name
-PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery
+PIPELINE_FOLDER=gs://${BUCKET_NAME}/pipelines/etl-pipeline
 ```
 
 2. Build and upload template to GCS bucket
@@ -118,17 +133,25 @@ mvn compile exec:java \
 3. Start Dataflow job using the template
 
 ```bash
-gcloud dataflow jobs run pubsub-to-bigquery-`date +"%Y%m%d-%H%M%S%z"` \
+gcloud dataflow jobs run etl-pipeline-`date +"%Y%m%d-%H%M%S%z"` \
  --gcs-location=${PIPELINE_FOLDER}/template \
- --parameters \
- "inputSubscription=${SUBSCRIPTION}, \
-  outputTransactionsTable=${TRANSACTIONS_TABLE}, \
-  outputErrorsTable=${ERRORS_TABLE}"
+ --parameters "inputSubscription=${SUBSCRIPTION},outputTransactionsTable=${TRANSACTIONS_TABLE},outputErrorsTable=${ERRORS_TABLE}"
 ```
+Controller service account can be configured by adding
+`--service-account-email=my-service-account-name@<project-id>.iam.gserviceaccount.com`. See
+[Controller service account](https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#controller_service_account)
+for more details.
 
 ### Deduplication
 
-Work in progress.
+Deduplication task trails the transactions table to ensure that two rows never have same consensusTimestamp. Due to
+at-least once guarantee of PubSub and Hedera Mirror Node (publishing to pubsub), it's possible that in rare cases,
+single transaction gets inserted more than once. Deduplication task removes these duplicates to ensure exactly-once
+guarantee. See class comments on [DedupeRunner](hedera-dedupe-bigquery/src/main/java/com/hedera/dedupe/DedupeRunner.java)
+for more details.
+
+## More documentation
+[Deployment](docs/deployment.md)
 
 ## Code of Conduct
 This project is governed by the [Contributor Covenant Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are

diff --git a/docs/deployment.md b/docs/deployment.md
@@ -0,0 +1,35 @@
+# Deployment
+
+## Requirements:
+1. BigQuery tables : transactions, errors, dedupe_state
+1. PubSub topic for transactions
+1. GCS bucket : Used for dataflow templates, staging and as temp location
+1. ETL Pipeline from PubSub to BigQuery:
+    1. PubSub subscription
+    1. Service account with following roles: BigQuery Data Editor, Dataflow Worker, Pub/Sub Subscriber, and Storage
+       Object Admin
+1. Deduplication Task
+    1. Service account with following roles: BigQuery Data Editor, BigQuery Job User, Monitoring Metric
+       Writer
+1. Mirror Importer
+    1. Service account with following roles: PubSub Publisher
+
+Resource creation can be automated using [setup-gcp-resources.sh](../scripts/setup-gcp-resources.sh).
+[Google Cloud SDK](https://cloud.google.com/sdk/docs) is required to run the script.
+
+## Steps
+
+1. Deploy ETL pipeline
+
+Use [deploy-etl-pipeline.sh](../scripts/deploy-etl-pipeline.sh) script to deploy the etl pipeline to GCP Dataflow.
+
+1. Deploy Deduplication task
+
+TODO
+
+1. Deploy Hedera Mirror Node Importer to publish transactions to the pubsub topic. See
+Mirror Nodes [installation](https://github.com/hashgraph/hedera-mirror-node/blob/master/docs/installation.md) and
+[configuration](https://github.com/hashgraph/hedera-mirror-node/blob/master/docs/configuration.md#importer) for more
+details.
+
+
diff --git a/docs/images/hedera_etl_ingestion.png b/docs/images/hedera_etl_ingestion.png
diff --git a/scripts/common.sh b/scripts/common.sh
@@ -0,0 +1,34 @@
+#!/usr/bin/env bash
+
+if [[ `which gcloud` == "" ]]; then
+  echo "Couldn't find 'gcloud' cli. Make sure Cloud SDK is installed. https://cloud.google.com/sdk/docs#install_the_latest_cloud_sdk_version"
+  exit 1
+fi
+
+if [[ "${PROJECT_ID}" == "" ]]; then
+  echo "PROJECT_ID is not set"
+  exit 1
+fi
+
+if [[ "${DEPLOYMENT_NAME}" == "" ]]; then
+  echo "DEPLOYMENT_NAME is not set. It is needed to name GCP resources."
+  exit 1
+fi
+
+NAME=${DEPLOYMENT_NAME}
+
+: ${KEYS_DIR:=`pwd`/${NAME}-keys}
+
+: ${BUCKET_NAME:=${NAME}-hedera-etl}
+
+: ${PUBSUB_TOPIC_NAME:=${NAME}-transactions-topic}
+: ${PUBSUB_SUBSCRIPTION_ETL_BIGQUERY:=${NAME}-etl-bigquery}
+
+: ${BQ_DATASET:=${NAME}}
+: ${BQ_TRANSACTIONS_TABLE:=${PROJECT_ID}:${BQ_DATASET}.transactions}
+: ${BQ_ERRORS_TABLE:=${PROJECT_ID}:${BQ_DATASET}.errors}
+: ${BQ_DEDUPE_STATE_TABLE:=${PROJECT_ID}:${BQ_DATASET}.dedupe_state}
+
+: ${SA_ETL_BIGQUERY:=${NAME}-etl-bigquery}
+: ${SA_DEDUPLICATION:=${NAME}-deduplication-bigquery}
+: ${SA_IMPORTER:=${NAME}-importer}
diff --git a/scripts/create-tables.sh b/scripts/create-tables.sh
@@ -0,0 +1,31 @@
+#!/usr/bin/env bash
+
+# Creates BigQuery tables needed by Dataflow and Deduplication jobs.
+#
+# Usage: PROJECT_ID=... DEPLOYMENT_NAME=<testnet/mainnet/etc> create-tables.sh
+# Optionally, table names can be set using BQ_TRANSACTIONS_TABLE, BQ_ERRORS_TABLE, BQ_DEDUPE_STATE_TABLE
+
+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+. ${SCRIPT_DIR}/common.sh
+
+bq mk \
+  --table \
+  --description "Hedera network transactions" \
+  --time_partitioning_field consensusTimestampTruncated \
+  --time_partitioning_type DAY \
+  --clustering_fields transactionType \
+  ${BQ_TRANSACTIONS_TABLE} \
+  ${SCRIPT_DIR}/../hedera-etl-dataflow/src/main/resources/schema.json
+
+bq mk \
+  --table \
+  --description "Hedera ETL Errors" \
+  ${BQ_ERRORS_TABLE} \
+  ${SCRIPT_DIR}/../hedera-etl-dataflow/src/main/resources/errors_schema.json
+
+bq mk \
+  --table \
+  --description "BigQuery deduplication job state" \
+  ${BQ_DEDUPE_STATE_TABLE} \
+  ${SCRIPT_DIR}/../hedera-dedupe-bigquery/state-schema.json
+
diff --git a/scripts/deploy-etl-pipeline.sh b/scripts/deploy-etl-pipeline.sh
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+
+# Deploys ETL-Pipeline to GCP Dataflow
+# This script assumes that resources have been allocated using setup-gcp-resources.sh
+# Usage: PROJECT_ID=... DEPLOYMENT_NAME=<testnet/mainnet/etc> KEYS_DIR=... deploy-etl-pipeline.sh
+
+if [[ "${KEYS_DIR}" == "" ]]; then
+  echo "KEYS_DIR is not set"
+  exit 1
+fi
+
+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+. ${SCRIPT_DIR}/common.sh
+
+JSON_KEY=${KEYS_DIR}/${SA_ETL_BIGQUERY}.json
+if [[ ! -f "${JSON_KEY}" ]]; then
+  echo "Couldn't find ${JSON_KEY}. Makes sure KEYS_DIR is correctly set up."
+  exit 1
+fi
+export GOOGLE_APPLICATION_CREDENTIALS=${JSON_KEY}
+
+cd ${SCRIPT_DIR}/../hedera-etl-dataflow
+
+echo "Building and uploading pipeline templates to GCS"
+
+PIPELINE_FOLDER=gs://${BUCKET_NAME}/pipelines/etl-bigquery
+mvn clean compile exec:java \
+ -Dexec.args=" \
+ --project=${PROJECT_ID} \
+ --stagingLocation=${PIPELINE_FOLDER}/staging \
+ --tempLocation=${PIPELINE_FOLDER}/temp \
+ --templateLocation=${PIPELINE_FOLDER}/template \
+ --runner=DataflowRunner"
+
+echo "Staring Dataflow job"
+
+SUBSCRIPTION="projects/${PROJECT_ID}/subscriptions/${PUBSUB_SUBSCRIPTION_ETL_BIGQUERY}"
+gcloud dataflow jobs run etl-pipeline-`date +"%Y%m%d-%H%M%S%z"` \
+  --gcs-location=${PIPELINE_FOLDER}/template \
+  --service-account-email=${SA_ETL_BIGQUERY}@${PROJECT_ID}.iam.gserviceaccount.com \
+  --parameters "inputSubscription=${SUBSCRIPTION},outputTransactionsTable=${BQ_TRANSACTIONS_TABLE},outputErrorsTable=${BQ_ERRORS_TABLE}"
diff --git a/scripts/setup-gcp-resources.sh b/scripts/setup-gcp-resources.sh
@@ -0,0 +1,84 @@
+#!/usr/bin/env bash
+# Creates GCP resources like Service Accounts, GCS buckets, BigQuery datasets and tables, etc for various components of
+# hedera-etl
+# For more details, refer to docs/deployment.md.
+#
+# Usage: PROJECT_ID=... DEPLOYMENT_NAME=<testnet/mainnet/etc> setup-gcp-resources.sh
+# Optionally, KEYS_DIR can be set to specify the directory where service accounts' keys would be downloaded. Default to
+# './${DEPLOYMENT_NAME}-keys'
+
+set -e
+
+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+. ${SCRIPT_DIR}/common.sh
+
+#### Functions ####
+create_service_account_with_roles()
+{
+  local sa_name=$1
+  local roles=$2
+  local description="$3"
+
+  # Create service account
+  gcloud iam service-accounts create ${sa_name} \
+    --project=${PROJECT_ID} \
+    --description="${description}"
+
+  # Assign roles to the service account
+  for role in ${roles}; do
+    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
+      --member serviceAccount:${sa_name}@${PROJECT_ID}.iam.gserviceaccount.com \
+      --role ${role} > /dev/null # Project's complete IAM policy is dumped to console otherwise
+    echo "Assigned role ${role} to ${sa_name}"
+  done
+}
+
+create_service_account_key()
+{
+  local sa_name=$1
+  local key_filename=${KEYS_DIR}/${sa_name}.json
+  # Download service account's key
+  gcloud iam service-accounts keys create ${key_filename} \
+    --iam-account=${sa_name}@${PROJECT_ID}.iam.gserviceaccount.com
+}
+
+#### Base resources ####
+mkdir -p ${KEYS_DIR}
+
+# Create BigQuery dataset and tables
+bq mk --project_id=${PROJECT_ID} ${NAME}
+DATASET_NAME=${BQ_DATASET} ${SCRIPT_DIR}/create-tables.sh
+
+# Create PubSub topic for transactions
+gcloud pubsub topics create ${PUBSUB_TOPIC_NAME} --project=${PROJECT_ID}
+
+# Create GCS bucket for dataflow pipelines
+gsutil mb -p ${PROJECT_ID} gs://${BUCKET_NAME}
+
+#### Resources for ETL to BigQuery ####
+gcloud pubsub subscriptions create ${PUBSUB_SUBSCRIPTION_ETL_BIGQUERY} \
+  --project=${PROJECT_ID} \
+  --topic=${PUBSUB_TOPIC_NAME} \
+  --message-retention-duration=7d \
+  --expiration-period=never
+
+create_service_account_with_roles \
+  ${SA_ETL_BIGQUERY} \
+  "roles/bigquery.dataEditor roles/dataflow.worker roles/pubsub.subscriber roles/storage.admin" \
+  "For pubsub --> bigquery dataflow controller"
+
+create_service_account_key ${SA_ETL_BIGQUERY}
+
+#### Resources for Deduplication task ####
+create_service_account_with_roles \
+  ${SA_DEDUPLICATION} \
+  "roles/bigquery.dataEditor roles/bigquery.jobUser roles/monitoring.metricWriter" \
+  "For BigQuery deduplication task"
+
+create_service_account_key ${SA_DEDUPLICATION}
+
+#### Resources for Hedera Mirror Importer ####
+create_service_account_with_roles \
+  ${SA_IMPORTER} "roles/pubsub.publisher" "For hedera mirror node importer (publishes to PubSub)"
+
+create_service_account_key ${SA_IMPORTER}