Skip to content

Latest commit

 

History

History
121 lines (79 loc) · 7.51 KB

deploying_dataflow_jobs.md

File metadata and controls

121 lines (79 loc) · 7.51 KB

Deploying Dataflow jobs

This document provides guidance on how to deploy Dataflow jobs using the infrastructure created by the blueprint. The blueprint uses uses VPC Service Controls to protect the Google Services. For general information related to Dataflow jobs see Deploying a Pipeline.

User access

The user or service account deploying the Dataflow pipeline must be in the access level of the VPC Service Controls perimeter. Use the input perimeter_additional_members to add the user or service account to the perimeter. Groups are not allowed in a VPC Service Controls perimeter.

The blueprint creates an egress rule that allows access to an external repository to fetch Dataflow templates. To use this egress rule you must:

  • Provide the project number of the project that hosts the external repository in the variable sdx_project_number
  • Add the user or service account deploying the Dataflow job in the variable data_ingestion_dataflow_deployer_identities. The terraform service account is automatically added to this rule.

To use external repositories in more then one project, create a copy of the default egress rule providing the project number of the other projects and add the new rule in the variable data_ingestion_egress_policies

APIs

You must enable all the APIs required by the Dataflow job in the Data Ingestion project. See the list of APIs enabled by default in the Data Ingestion project in the README file.

Service Accounts Roles

You must grant all the additional roles required by the Data ingestion Dataflow Controller Service Account before deploying the Dataflow job. Check the current roles associated with the Data ingestion Dataflow Controller Service Account in the files linked below:

Providing a subnetwork

You must provide a subnetwork to deploy a Dataflow job.

We do not recommend using a Default Network in the Data ingestion project.

If you are using a Shared VPC, you must add the Shared VPC as a Trusted subnetworks using the trusted_shared_vpc_subnetworks variable. See the inputs section for additional information.

The subnetwork must be configured for Private Google Access. Make sure you have configured all the firewall rules and DNS configurations listed in the sections below.

Firewall rules

DNS configurations

Temporary and Staging Location

Use the data_ingestion_dataflow_bucket_name output of the main module as the Temporary and Staging Location bucket when configuring the pipeline options:

Dataflow Worker Service Account

Use the dataflow_controller_service_account_email output of the main module as the Dataflow Controller Service Account:

Note: The user or service account being used to deploy Dataflow Jobs must have roles/iam.serviceAccountUser in the Dataflow Controller Service Account.

Customer Managed Encryption Key

Use the cmek_data_ingestion_crypto_key output of the main module as the Dataflow KMS Key:

Disable Public IPs

Disabling Public IPs helps to better secure you data processing infrastructure. Make sure you have your subnetwork configured as the Subnetwork section details.

Enable Streaming Engine

Enabling Streaming Engine is important to ensure all the performance benefits of the infrastructure. You can learn more about it in the documentation.

Deploying Dataflow Flex Jobs

We recommend the usage of Flex Job Templates. You can learn more about the differences between Classic and Flex Templates here.

Deploying with Terraform

Use the google_dataflow_flex_template_job resource.

Deploying with gcloud Command

Run the following commands to create a Dataflow Flex Job using the gcloud command.

export PROJECT_ID=<PROJECT_ID>
export DATAFLOW_BUCKET=<DATAFLOW_BUCKET>
export DATAFLOW_KMS_KEY=<DATAFLOW_KMS_KEY>
export SERVICE_ACCOUNT_EMAIL=<SERVICE_ACCOUNT_EMAIL>
export SUBNETWORK=<SUBNETWORK_SELF_LINK>

gcloud dataflow flex-template run "TEMPLATE_NAME`date +%Y%m%d-%H%M%S`" \
    --template-file-gcs-location="TEMPLATE_NAME_LOCATION" \
    --project="${PROJECT_ID}" \
    --staging-location="${DATAFLOW_BUCKET}/staging/" \
    --temp-location="${DATAFLOW_BUCKET}/tmp/" \
    --dataflow-kms-key="${DATAFLOW_KMS_KEY}" \
    --service-account-email="${SERVICE_ACCOUNT_EMAIL}" \
    --subnetwork="${SUBNETWORK}" \
    --region="us-east4" \
    --disable-public-ips \
    --enable-streaming-engine

For more details about gcloud dataflow flex-template see the command documentation.

In some parameters, such as a table schema, you may need to use the comma ,. On these cases you need to use gcloud topic escaping.