- Overview
- What you will accomplish
- Prerequisites
- Create your Reddit bot account
- Set up an S3 Bucket
- Deploy the AWS Glue data catalog in CloudFormation
- Set up Kinesis Firehose Delivery Stream
- Create a Key Pair for your streaming server
- Deploy the EC2 streaming server in CloudFormation
- Monitor the delivery stream
- Use Athena to develop insights
- Clean up the environment
- Conclusion
- Appendix
AWS provides several key services for an easy way to quickly deploy and manage data streaming in the cloud. Reddit is a popular social news aggregation, web content rating, and discussion website. At peak times, Reddit can see over 300,000 comments and 35,000 submissions an hour. The Reddit API offers developers a simple way to collect all of this data, which is a perfect use case to learn how to use Kinesis Firehose, S3, Glue, and Athena.
In this tutorial, you will play the role of a data architect looking to modernize a company’s streaming pipeline. You will create a Kinesis Firehose delivery stream from an EC2 server to an S3 data lake. With the help of AWS Glue and Amazon Athena, you’ll be able to develop insights on the data as it accumulates in your data lake.
- Create a Reddit App using the Reddit developer site
- Provision an S3 bucket to act as a data lake and the target for your stream data
- Provision a Kinesis Firehose Delivery Stream that will accept data from various sources and deliver it to the S3 bucket
- Deploy and run an EC2 Streaming python app via CloudFormation
- Create a Glue data catalog via CloudFormation to provide schemas and structure to your data
- Use Athena to directly query your S3 bucket with SQL
This tutorial requires:
- A laptop with Wi-Fi running Microsoft Windows, Mac OS X, or Linux
- An Internet browser of Chrome or Firefox
- An AWS account to provision the AWS infrastructure
- Skill level: A basic understanding of desktop computing is helpful but not required
- AWS experience: Prior knowledge of base AWS infrastructure (VPC, EC2, S3) is helpful, but not required to complete this exercise
- Basic Linux Experience: needed to troubleshoot any errors in the EC2 instance. to learn go here
-
Follow prompts to create new reddit account:
- Provide email address
- Choose username and password
- Click Finish
-
Once your account is created, go to reddit developer console.
-
Select “are you a developer? Create an app...”
-
Give it a name.
-
Select script. <--- This is important!
-
For about url and redirect uri, use http://127.0.0.1
-
You will now get a client_id (underneath web app) and secret
-
Keep track of your Reddit account username, password, app client_id (in blue box), and app secret (in red box). These will be used in tutorial Step 11
- Our app information is registered now. Before you begin setting up the server or the delivery stream, you need a place to store the data that will be generated.
-
Open the Amazon S3 console
-
Choose Create bucket
-
In the Bucket name field, type a unique DNS-compliant name for your new bucket. Create your own bucket name using the following naming guidelines:
-
The name must be unique across all existing bucket names in Amazon S3
-
Example: reddit-analytics-bucket-<add random number here>
-
After you create the bucket you cannot change the name, so choose wisely
-
Choose a bucket name that reflects the objects in the bucket because the bucket name is visible in the URL that points to the objects that you're going to put in your bucket
-
For information about naming buckets, see Rules for Bucket Naming in the Amazon Simple Storage Service Developer Guide
-
-
For Region, choose US East (N. Virginia) as the region where you want the bucket to reside
-
Keep defaults and continue clicking Next
-
Choose Create
Now that you’ve created a bucket, let’s set up a delivery stream for your data.
- S3 is a place to store many different kinds of data / files. To provide the data files with structure that services can reference, you need to set up a data catalog. AWS Glue is the perfect service for this use case.
In this step we will be using a tool called CloudFormation. Instead of going through the AWS console and creating glue databases and glue tables click by click, we can utilize CloudFormation to deploy the infrastructure quickly and easily.
We will use Cloudformation YAML templates located in this GitHub repository
-
Go to the glue.yml file located here
-
Right-click anywhere and select Save as…
-
Rename the file from glue.txt to glue.yml
-
Select All Files as the file format and select Save
-
Open the AWS CloudFormation console
-
If this is a new AWS CloudFormation account, click Create New Stack Otherwise, click Create Stack
-
In the Template section, select Upload a template file
-
Select Choose File and upload the newly downloaded glue.yml template
-
Decide on your stack name
-
Under pBucketName set your bucket name from the previous step
-
Continue until the last step and click Create stack
-
Click on Events tab. Wait until the stack status is CREATE_COMPLETE
- Now you have a destination for your data (S3) and a data catalog (AWS Glue). Next, let’s deploy the pipes that will allow data to travel between services.
-
Open the Kinesis Data Firehose console or select Kinesis in the Services dropdown
-
Choose Create Delivery Stream
-
Delivery stream name – Type a name for the delivery stream
Example: raw-reddit-comment-delivery-stream
-
Keep default settings on Step 1 - you will be using a direct PUT as source. Scroll down and click Next
-
In Step 2, enable record format conversion by using the following settings:
-
Click Next
-
On the Destination page, choose the following options
-
Destination – Choose Amazon S3
-
S3 bucket – Choose an existing bucket created in tutorial Step 6
-
S3 prefix – add "raw_reddit_comments/" as prefix
-
S3 error prefix - add "raw_reddit_comments_error/" as prefix
-
-
Choose Next
-
On the Configuration page, Change Buffer time to 60 seconds
-
For IAM Role, click Create new or choose
-
For the IAM Role summary, use the following settings:
-
Choose Allow
-
You should return to the Kinesis Data Firehose delivery stream set-up steps in the Kinesis Data Firehose console
-
Choose Next
-
On the Review page, review your settings, and then choose Create Delivery Stream
- The pipeline and destination are now available for use. In the next several steps, you will be creating the python application that generates Reddit comment data.
-
Open the Amazon EC2 console or select EC2 under Services dropdown
-
In the navigation pane, under NETWORK & SECURITY, choose Key Pairs
Note: The navigation pane is on the left side of the Amazon EC2 console. If you do not see the pane, it might be minimized; choose the arrow to expand the pane
-
Choose Create Key Pair
-
For Key pair name, enter a name for the new key pair (ex: RedditBotKey), and then choose Create
-
The private key file is automatically downloaded by your browser. The base file name is the name you specified as the name of your key pair, and the file name extension is .pem. Save the private key file in a safe place
Important: This is the only chance for you to save the private key file. You'll need to provide the name of your key pair when you launch an instance and the corresponding private key each time you connect to the instance
A key pair will allow you to securely access a server. In the next steps, you will deploy the server.
In this step you will be using a tool called CloudFormation. Instead of going through the AWS console and creating an EC2 instance click by click, you can utilize CloudFormation to deploy the infrastructure quickly. This CloudFormation template has EC2 user data to set up the machine. The EC2 user data achieves the following:
- Installs python 3.6 and several libraries needed for the script to run
- Clones a GitHub repository that contains the python script
- Updates python script with custom permissions and parameters
- Executes the script to begin the data stream
We will use Cloudformation YAML templates located in this GitHub repository.
-
Go to the ec2.yml file located here.
-
Right-click anywhere and select Save as…
-
Rename the file from ec2.txt to ec2.yml
-
Select All Files as the file format and select Save
-
Open the AWS CloudFormation console or select CloudFormation under the Services dropdown
-
Click Create New Stack / Create Stack
-
In the Template section, select Upload a template file
-
Select Choose File and upload the newly downloaded ec2.yml template
-
Click Next
-
Provide a stack name (ex: reddit-stream-server)
-
For pKeyName and provide the key name that you created in tutorial Step 9
-
Use your reddit app info and reddit account for the parameters pRedditAppSecret, pRedditClientID, pRedditUsername, and pRedditPassword
-
You can choose to leave the rest of the parameters as their default values.
-
Continue to click Next
-
On the last step, acknowledge IAM resource creation and click Create Stack
-
Wait for your EC2 instance to be created.
-
Make a note of the Public IP and Public DNS Name given to the newly created instance. You can find these in the Cloudformation Outputs tab.
-
Open the Amazon EC2 console or select EC2 under Services dropdown
-
Select INSTANCES in the navigation pane
-
Ensure that an EC2 instance has been created and running. (This can take several minutes to deploy)
- Now that the EC2 instance is provisioned and the script is running, the data is streaming to Kinesis Firehose. In the next step we’ll monitor the data as it moves through the delivery stream and into S3.
-
Open the Amazon Kinesis Firehose console or select Kinesis in the Services dropdown
-
Select the delivery stream created in step 8.
-
Select Monitoring Tab
-
Click refresh button over the next 3 minutes. You should start to see records coming in
-
If you are still not seeing data after 3-5 minutes, go to Appendix I for troubleshooting.
-
Now let’s check the S3 bucket
-
Open the Amazon S3 console or select S3 in the Services dropdown
-
Click the bucket name of the bucket that you created in step 2
-
Verify that records are being PUT into your s3 bucket
Now that data is streaming into s3, let’s build a data catalog so that you can query our s3 files
- Now that you have all of our infrastructure in place, you can finally begin to analyze the data currently streaming into our data lake. You will use Amazon Athena, a great tool for ad-hoc queries on S3 data.
-
Open the Amazon Athena console or select Athena in the Services dropdown
-
Choose the glue database (reddit_glue_db) populated on the left view
-
Select the table (raw_reddit_comments) to view the table schema
-
You should now be able to use SQL to query the table (S3 data)
Here are some example queries to begin exploring the data streaming into S3:
-- total number of comments select count(*) from raw_reddit_comments; -- general sentiment of reddit Today select round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment from raw_reddit_comments where comment_date like '%2019-08-22%'; -- total comments collected per subreddits select count(*) as num_comments, subreddit from raw_reddit_comments group by subreddit order by num_comments DESC; -- average sentiment per subreddits select round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment, subreddit from raw_reddit_comments group by subreddit order by avg_comment_tb_sentiment DESC; -- list all Subreddits select distinct(subreddit) from raw_reddit_comments; -- top 10 most positive comments by subreddit select subreddit, comment_body from raw_reddit_comments where subreddit = '${subreddit}' order by comment_tb_sentiment DESC limit 10; -- most active subreddits and their sentiment select subreddit, count(*) as num_comments, round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment from raw_reddit_comments group by subreddit order by num_comments DESC; -- search term frequency by subreddit where comments greater than 5 select subreddit, count(*) as comment_occurrences from raw_reddit_comments where LOWER(comment_body) like '%puppy%' group by subreddit having count(*) > 5 order by comment_occurrences desc; -- search term sentiment by subreddit select subreddit, round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment from raw_reddit_comments where LOWER(comment_body) like '%puppy%' group by subreddit having count(*) > 5 order by avg_comment_tb_sentiment desc; -- top 25 most positive comments about a search term select subreddit, author_name, comment_body, comment_tb_sentiment from raw_reddit_comments where LOWER(comment_body) like '%puppy%' order by comment_tb_sentiment desc limit 25; -- total sentiment for search term SELECT round(avg(comment_tb_sentiment), 4) as avg_comment_tb_sentiment FROM ( SELECT subreddit, author_name, comment_body, comment_tb_sentiment FROM raw_reddit_comments WHERE LOWER(comment_body) LIKE '%puppy%')
- Hopefully by now you have found some interesting insights into Reddit and the overall public sentiment. Athena is a great service for ad-hoc queries like this. You are approaching the end of this tutorial, so you will start terminating services and instances to prevent further billing.
Do not skip this step! Leaving AWS resources without tearing down can result a bill in the end of the month. Make sure you follow the steps to remove the resources you’ve created
-
EC2 – Our EC2 instance was created from a CloudFormation template, we’ll delete the stack and the key pair
- Open the AWS CloudFormation console or select CloudFormation under Services dropdown
- On the left panel that says stacks – click on the EC2 stack you’ve created
- Click Delete on top of the pane
- Open the Amazon EC2 console or select EC2 under Services dropdown
- In the navigation pane, under NETWORK & SECURITY, choose Key Pairs
- Click on the key pair name you’ve created and hit Delete on top
-
Kinesis –
- Open the Kinesis Data Firehose console or select Kinesis in the Services dropdown
- Click on the stream you’ve created at the top right corner (blue link).
- On the top right corner click on Delete delivery stream
-
Glue –
- Open the AWS CloudFormation console or select Cloudformation under Services dropdown
- On the left panel that says stacks – click on the EC2 stack you’ve created
- Click Delete on top of the pane
-
S3 –
- Open the Amazon S3 console
- Mark the bucket you created and hit Delete
In this tutorial, you have walked through the process of deploying a sample Python application that uses the Reddit API and AWS SDK for Python to stream Reddit data into Amazon Kinesis Firehose. You learned basic operations to deploy a real-time data streaming pipeline and data lake. Finally, you developed insights on the data using Amazon Athena’s ad-hoc SQL querying.
-
Find the Public IP address that you noted down in Step 10 and the key pair you downloaded in Step 9.
-
Open up a Terminal
-
Go to the directory that your key pair was downloaded to.
-
Ensure key has correct permissions
chmod 400 <key pair name>.pem
-
SSH into the machine with the following command:
ssh -i <insert your key pair name here> ec2-user@<insert public IP address here>
-
Confirm that the correct credentials have been added to your application with the following command:
sudo cat /reddit/analyzing-reddit-sentiment-with-aws/python-app/praw.ini
-
Confirm that the correct delivery stream name was added to your application with the following command. Look for DeliveryStreamName=’<your delivery stream name>’
sudo cat /reddit/analyzing-reddit-sentiment-with-aws/python-app/comment-stream.py
-
If there are errors found, delete the CloudFormation stack that didn’t work properly. Go back and retry Step 10. If there are no errors you can check the logs:
sudo tail /tmp/reddit-stream.log
Some common errors include:
DEBUG:prawcore:Response: 503 (Reddit servers are down)
DEBUG:prawcore:Response: 502 (Reddit server request error)
DEBUG:prawcore:Response: 403 (Your Reddit username/password is incorrect)
This sample code is made available under the MIT-0 license. See the LICENSE file.