This document provides guidance on how to deploy Dataflow jobs using the infrastructure created by the blueprint. The blueprint uses uses VPC Service Controls to protect the Google Services. For general information related to Dataflow jobs see Deploying a Pipeline.
The user or service account deploying the Dataflow pipeline must be in the access level of the VPC Service Controls perimeter.
Use the input perimeter_additional_members
to add the user or service account to the perimeter. Groups are not allowed in a VPC Service Controls perimeter.
The blueprint creates an egress rule that allows access to an external repository to fetch Dataflow templates. To use this egress rule you must:
- Provide the project number of the project that hosts the external repository in the variable
sdx_project_number
- Add the user or service account deploying the Dataflow job in the variable
data_ingestion_dataflow_deployer_identities
. The terraform service account is automatically added to this rule.
To use external repositories in more then one project, create a copy of the default egress rule providing the project number of the other projects and add the new rule in the variable data_ingestion_egress_policies
You must enable all the APIs required by the Dataflow job in the Data Ingestion project. See the list of APIs enabled by default in the Data Ingestion project in the README file.
You must grant all the additional roles required by the Data ingestion Dataflow Controller Service Account before deploying the Dataflow job. Check the current roles associated with the Data ingestion Dataflow Controller Service Account in the files linked below:
You must provide a subnetwork to deploy a Dataflow job.
We do not recommend using a Default Network in the Data ingestion project.
If you are using a Shared VPC, you must add the Shared VPC as a Trusted subnetworks using the trusted_shared_vpc_subnetworks
variable. See the inputs section for additional information.
The subnetwork must be configured for Private Google Access. Make sure you have configured all the firewall rules and DNS configurations listed in the sections below.
- All the egress should be denied.
- Allow only Restricted API Egress by TPC at 443 port.
- Allow only Private API Egress by TPC at 443 port.
- Allow ingress Dataflow workers by TPC at ports 12345 and 12346.
- Allow egress Dataflow workers by TPC at ports 12345 and 12346.
Use the data_ingestion_dataflow_bucket_name
output of the main module as the Temporary and Staging Location bucket when configuring the
pipeline options:
Use the dataflow_controller_service_account_email
output of the main module as the
Dataflow Controller Service Account:
Note: The user or service account being used to deploy Dataflow Jobs must have roles/iam.serviceAccountUser
in the Dataflow Controller Service Account.
Use the cmek_data_ingestion_crypto_key
output of the main module as the Dataflow KMS Key:
Disabling Public IPs helps to better secure you data processing infrastructure. Make sure you have your subnetwork configured as the Subnetwork section details.
Enabling Streaming Engine is important to ensure all the performance benefits of the infrastructure. You can learn more about it in the documentation.
We recommend the usage of Flex Job Templates. You can learn more about the differences between Classic and Flex Templates here.
Use the google_dataflow_flex_template_job
resource.
Run the following commands to create a Dataflow Flex Job using the gcloud command.
export PROJECT_ID=<PROJECT_ID>
export DATAFLOW_BUCKET=<DATAFLOW_BUCKET>
export DATAFLOW_KMS_KEY=<DATAFLOW_KMS_KEY>
export SERVICE_ACCOUNT_EMAIL=<SERVICE_ACCOUNT_EMAIL>
export SUBNETWORK=<SUBNETWORK_SELF_LINK>
gcloud dataflow flex-template run "TEMPLATE_NAME`date +%Y%m%d-%H%M%S`" \
--template-file-gcs-location="TEMPLATE_NAME_LOCATION" \
--project="${PROJECT_ID}" \
--staging-location="${DATAFLOW_BUCKET}/staging/" \
--temp-location="${DATAFLOW_BUCKET}/tmp/" \
--dataflow-kms-key="${DATAFLOW_KMS_KEY}" \
--service-account-email="${SERVICE_ACCOUNT_EMAIL}" \
--subnetwork="${SUBNETWORK}" \
--region="us-east4" \
--disable-public-ips \
--enable-streaming-engine
For more details about gcloud dataflow flex-template
see the command documentation.
In some parameters, such as a table schema, you may need to use the comma ,
. On these cases you need to use gcloud topic escaping.