Skip to content

varundhussa/dataflow-pubsub-dedup

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deduplication with Cloud PubSub and Cloud Dataflow on Google Cloud Platform

This is the source code that accompanies the solution: Deduplication of messages with Cloud PubSub and Cloud Dataflow. This sample code demonstrates three approaches for deduplication:

  • PubSubIO: com.google.examples.dfdedup.DedupWithPubSubIO
  • Distinct transform: com.google.examples.dfdedup.DedupWithDistinct
  • Custom state based deduplication: com.google.examples.dfdedup.DedupWithStateAndGC

End to end pipeline

You can run the following end to end pipeline to explore deduplication behavior across all three approaches:

End to end flow

Setting up resources

NOTE: If you're new to GCP, please see quickstarts for Cloud PubSub, BigQuery and Cloud Dataflow

BigQuery

Use the schema files under bqschemas/ to create

Cloud PubSub

Running Python-based the data generator

Blah blah

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 83.5%
  • Python 10.5%
  • Shell 6.0%