This dataset was extracted from a dataset from Cornell University(http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). After the Data 8 team transformed the dataset (e.g., converting the words to lowercase, removing the naughty words, and converting the counts to frequencies), they created this new dataset containing the frequency of 5000 common words in each movie. This is my attempt to build a classifier that guesses whether a movie is a comedy or a thriller, using only the number of times words appear in the movies's screenplay. This project shows my ability to build a k-nearest-neighbors classifier and test a classifier on data. This project also involves Exploratory Data Analysis using Linear Regression.
Tools: Jupyter Notebook, Python, NumPy, Matplotlib
Created as part of the Data 8 class @ UC Berkeley