This project uses GCP and Pyspark to analyze four universities tweets data. In nowadays world, social media takes an important role in nowadays’ social life. Twitter is the one of the most influential social medias all over the world. Many universities use Twitter to interact with their stakeholders and to promote school brand. By analyzing university tweets data, we could gain more insights on university tweets' users and their concerns.
Original data has 300M records (2TB+). This project has selected University of Chicago, Northeastern University, University of Nebraska-Lincoln, and Brown University as subset data to compare twitterers’ difference and tweet content difference. Lastly, based on the analysis, we can offer recommendations for universities to improve their twitter account management.
- Identify tweets related to UChicago, Brown University, Northeast University, and university of Nebraska.
- Complete thorough EDA to identify which variables you can use to profile the Twitterers
- Identify the most prolific / influential Twitterers By message volume By message retweet
- How much are they tweeting about the Universities vs. other topics?
- Where are these Twitterers located? For UChicago For other universities
- Do you see any relationship between university locations and Twitterers’ locations? Visualize the relationships
- What distinguishes University of Chicago Twitterers vs Twitterers who tweet about other universities Visualize the trends
- What are the timelines of these tweets? Do you see significant peaks and valleys?
- Do you see data collection gaps?
- How unique are the messages for each of these universities? Are they mostly unique? Or mostly people are just copy-pasting the same text?