In this project, we take almost 60,000 OK Cupid dating profiles, and we look to:
- run various statistical analysis to give us insight into the profile
- create data visualizations to easier see our analysis, and see patterns and trends in the data
- create various machine learning models to accurately predict variables using a users basic inputs
Scraped from OKCupid, uploaded here.
Three machine learning models were used:
- Linear Regression
- K Neighbors Classification
- Decision Tree Classification
This project was intensive in data cleaning as a lot of the data was either missing, or non-sensical. Automating the data cleaning process and the analysis on the newly cleaned data was easily the longest/hardest part here. In terms of modeling, our most accurate model was a decision tree with almost 80% accuracy (and a macro 80% F1 score). Overall - the decision tree process was the most efficient in our selection of our output variable.