Skip to content

This is my final project for my CIS 9440 Data Warehousing graduate course. I analyzed 2 datasets and went through the ETL process using Python and loaded the finalized dataset to BigQuery to create visualizations via Looker.

Notifications You must be signed in to change notification settings

sallywuhoo/CIS9440-NYC-Rodent-Complaints

Repository files navigation

Analyzing NYC311 Rodent Complaints and its Correlation with Weather from 2016-2022

For this final project for my CIS 9440 Data Warehousing and Analytics graduate course, my group and I analyzed 2016-2022 rodent complaint data reported to NYC 311 and explored whether there's a correlation between these complaints and the average daily weather. The rodent complaint dataset can be found on the NYC Open data website and the NYC weather data can be found on multiple weather websites, but for the purposes of this analysis, it was taken from Kaggle. As a group, we sketched out a dimensional model for the finalized data warehouse and identified KPIS. My responsibilities for this project was to extract, clean, and merge the datasets via Python and finally load the finalized dataset to BigQuery for our group to use to create queries for visualizations on Looker. In addition to the Looker visualizations, I've replicated them on this Jupyter Notebook.

ETL Process

Using Jupyter Notebooks and Python, the 2 CSV files of the weather and the 311 rodent complaints datasets were uploaded using the Pandas library and stored in 2 separate dataframes. Next, I cleaned the datasets by renaming the columns that the 2 datasets will be merged on and also dropping columns that are irrelevant for this analysis. Additionally, I added a column to identify whether a location was considered "inside" or "outside" for later analysis to identify areas that the most rodent complaints were made from. I've also created the fact table and dimensional tables on Jupyter Notebooks to show the different tables that our group identified when drafting the dimensional model. An intermediary step I performed on the weather data is calculating the average weather temperature and average precipitation rate as the original dataset contained the hourly temperatures.

To approach the transformation of the 2 datasets to 1, I merged the datasets on the "Year", "Month", and "Day" columns so the finalized dataframe contained the "Year", "Month", "Day", "Temperature", "Precipitation", "Borough", "Location", "Total Complaints", and "Full Date". Finally, this dataset was downloaded as a CSV and uploaded to BigQuery as a data warehouse that can be queried. As a group, we built visualizations to create a dashboard showcasing the relationship between 311 Rodent Complaints and the average NYC temperatures.

Seasonality Effect on Rodent Complaints

Plotting the average temperatures (blue line) and total rodent complaints (orange line) from 2016-2022, it's evident that as temperature rises in the warmer months such as spring and summer, so do the number of rodent complaints. However, there are 2 points worth noting; the first is a spike in complaints in 2017 and the second being the drop in 2022. I'm unable to find further news or research explaining why this is. seasonality

Total Complaints by Borough

It appears the borough with the most rodent complaints is Brooklyn then followed by Manhattan, Bronx, Queens, and finally Staten Island. This is supported by research done by MMPC in which their article "There Are Now 3 Million Rats in NYC, a 50% Increase Since 2010" shows that Brooklyn has the most number of rats at ~1.16M; as of 2022, Brooklyn has seen a 56% increase in rate population from 2010. The article further shows that Queens has the second greatest rat population at ~752K and is followed by Manhattan at ~512K. borough

Total Complaints by Location

For the purposes of this analysis, I only considered 2 locations: indoors and outdoors. I grouped all property types that were public into outdoors and the others into indoors. Looking at the bargraph, it appears that majority of the 311 complaints made were for locations indoors. However, this doesn't necessarily mean there are more rodents indoors than outdoors since the likelihood of someone making a rodent complaint is much higher indoors than if they were outdoors. indoor vs outdoor

Taking a closer look at the different location types that rodent complaints were made in, it appears that 3+ Family Apartment Buildings was the location with the most rodent complaints followed by 1-2 family dwellings. This could be because in apartment buildings, rodents have all the necessities to survive: shelter between walls and floorboards, warmth from heaters, and water/food from trash. This is a concerning find but it further proves that NYC is in need of help to eradicate the rising rodent population not only outside, but also inside apartment buildings. location type

Correlation between Rodent Complaints & Weather/Precipitation

Prior to plotting this heatmap, my hypothesis is that 1) there would be a significant positive correlation between the temperature and total complaints and 2) a significant negative correlation between the precipitation and total complaints. The reasoning behind this is for the former 1), I thought as temperatures rose so would the amount of rodents; for the latter 2), I thought that as there is rain or snow, there would not be any rodents running around. However, it appears that only my initial guess for 1) temperature and total complaints was proven correct with a positive correlation value of 0.2. However, my guess for 2) precipitation and total complaints was off as there is only a slight negative correlation between the 2 variables. correlation heatmap

About

This is my final project for my CIS 9440 Data Warehousing graduate course. I analyzed 2 datasets and went through the ETL process using Python and loaded the finalized dataset to BigQuery to create visualizations via Looker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published