Albert Y. Kim and Adriana Escobedo-Land
Data and code for OkCupid Profile Data for Introductory Statistics and Data Science Courses (Journal of Statistics Education July 2015, Volume 23, Number 2).
JSE.bib
: bibliography fileJSE.pdf
: PDF of documentJSE.Rnw
: R Sweave document to recreateJSE.pdf
.JSE.R
: R code used in documentokcupid_codebook.txt
: codebook for all variablesprofiles.csv.zip
: CSV file of profile data (unzip this first)
Note:
- Permission to use this data set was explicitly granted by OkCupid.
- Usernames are not included.
JSE.Rnw
Sweave document was compiled using theknitr
package. In RStudio, go to "Tools" -> "Project Options" -> "Sweave" -> "Weave Rnw files using:" and select knitr.
A mosaicplot of the cross-classification of the 59946 users' sex and sexual orientation:
Linear regression (in red) and logistic regression (in blue) compared. Note both the x-axis (height) and y-axis (is female: 1 if user is female, 0 if user is male) have random jitter added to better visualize the number of points involved for each (height x gender) pair.
Fitted probabilities p-hat of each user being female along witha decision threshold (in red) used to predict if user is female or not.