Skip to content

End to end data pipeline to extract and analyze submissions from any subreddit using Pushshift, python, dbt and BigQuery.

License

Notifications You must be signed in to change notification settings

wtfzambo/subreddit-analytics

Repository files navigation

Subreddit analytics

TL;DR: This project was built to analyze the subreddit "r/dataengineering", but can be used to pull and analyze data from any subreddit (as long as Pushshift API is alive).

Why tho

When I'm not debugging data pipelines I spend a lot of time reading how other engineers debug data pipelines on the dataengineering subreddit. I'm fun at parties I swear.

Day in the life of a data engineer

I thought it would be a fun project to pull data from said subreddit and run some poor man's NLP analysis on it. Very meta I know.

In other words, this whole project could be described with the following question:

**What do data engineers talk about?**

The gist of it is as follows:

  1. Use Pushshift with PMAW to get as many submission (post) IDs as possible.
  2. Once the list is obtained, use asyncpraw get pull data for each submission found, as well as all their comments.
  3. Perform some minor cleaning and persist said data in DuckDB. Why DuckDB? Because I wanted to try it. Also it makes handling upserts more convenient to w.r.t. parquet files stashed somewhere.
  4. Take the 2 tables stored in DuckDB and upload them to BigQuery.
  5. Use dbt to model this data and perform some NLP shenanigans.
  6. ????
  7. Profit.

The end result is this dashboard. The most interesting part is the "Top submissions" widget. It shows a list of all the submissions pulled ordered by upvotes, but if filtered by "topic", it will show the most interesting submissions pertaining to that topic.

🟢 Try playing with the "Topic" filter on the left and see how the "Top submissions" widget changes!


⚠️ 2023-04-26 - The dashboard will stay up for roughly 3 months or until I run out of GCP credits, whichever expires first.

Alt text

Note You will notice that a lot of submissions are missing. This is because Reddit's API doesn't allow the user to query over a range of time, so if one wants, say, find all the submissions for the last year in a specific subreddit, they can't using the official API. The only way is to use Pushshift, but since it's a free service maintained by a single individual in their free time, completeness of results cannot be guaranteed.

Getting started

  1. Setup Google stuff
  • You first need to create a Reddit APP. You can find instructions here.
  • Then you need to create a Google Cloud Platform project and set up a service account with the permission "BigQuery Admin". Create a key for this account and save it somewhere safe.
  • Install the Google Cloud SDK and run:
gcloud auth activate-service-account --key-file <path_to_your_credentials>.json
gcloud auth application-default login
  1. Setup the repo
  • Clone the repo locally, and next to .env.example create a .env file which you will fill with the info from your Reddit App and your GCP project.
  • Install all the dependencies and activate your environment with:
poetry install
poetry shell
  1. Create the infrastructure

Now it's time to setup the infrastructure. Install terraform with:

brew install terraform

Then:

cd infra/
terraform init
terraform plan
terraform apply

Terraform will ask you the ID of your GCP project. You can get it easily by running:

gcloud projects list
  1. Run the pipeline!

Set up a free account with Prefect, then login in your shell with:

prefect cloud login

After you're done, you need to create the necessary blocks before running the pipeline:

python prefect/blocks.py

Finally, run the pipeline:

python prefect/flows_api_to_bq.py --refresh true

Note You can get a list of accepted params by running

python prefect/flows_api_to_bq.py --help

By default, the pipeline will reuse the IDs saved in assets/post_ids.csv. To try and use Pushshift API, open flows_api_to_bq.py, comment line 28 and un-comment line 27.

You will see loads of logs being printed in the console. That's fine, it means it's running. Since the pipeline is async, there are lots of tasks running concurrently.

If you used the saved post IDs, it should take about 3-4 minutes to complete.

  1. Build dbt models

Once the pipeline has finished running, you can build the dbt models:

cd dbt/
dbt build

Note

The output tables are neither partitioned nor clustered. This is due to the fact that the size of data is very minimal, about a dozen megabytes in total.

Outputs

The following tables have been used for analysis:

  • int_submissions_score_adjusted

A table with all submissions' data pulled by the pipeline, containing info such as number of comments, upvotes, submission text etc. The score_adjusted field is calculated for each submission by taking in consideration the number of comments and upvotes with respect to all the dataset, accounting for outliers. It ranges between 0 and 1 and represents how "hot" a submission is.

  • tfidf_submissions_and_comments

TF-IDF score for each word in each document, where a document is a concatenation of a submission's body + all its comments.

  • dim_words

A vocabulary meant to represent the most relevant and hot topics across all r/dataengineering subreddit. The word_score is computed by summing the product of each word's TF-IDF value times its corresponding submission score_adjusted. Which in the context of TF-IDF is kind of an anti-pattern, but it gives satisfactory results.

About

End to end data pipeline to extract and analyze submissions from any subreddit using Pushshift, python, dbt and BigQuery.

Resources

License

Stars

Watchers

Forks