README.Rmd

---
title: "OkCupid Profile Data for Intro Stats and Data Science Courses (Revised in 2021)"
author: "Albert Y. Kim and Adriana Escobedo-Land"
output: github_document
references:
- id: Kim2015
  title: OkCupid Profile Data for Introductory Statistics and Data Science Courses
  author:
  - family: Kim
    given: Albert Y. Kim
  - family: Escobedo-Land
    given: Adriana
  ISSN: 1069-1898
  volume: 23
  URL: 'http://www.amstat.org/publications/jse/v23n2/kim.pdf'
  publisher: American Statistical Association
  issue: 2
  type: article-journal
  issued:
    year: 2015
    month: 7
---

```{r, include=FALSE}
library(mosaic)
library(dplyr)
library(ggplot2)
library(knitr)
```

Data and code for [OkCupid Profile Data for Introductory Statistics and Data Science Courses](https://www.tandfonline.com/doi/abs/10.1080/10691898.2015.11889737), Journal of Statistics Education July 2015, Volume 23, Number 2. **The original manuscript was subsequently revised in 2021.**

* `JSE_revised.Rnw`:  `.Rnw` source document to recreate `JSE_revised.pdf` using `knitr`. In RStudio, go to "Tools" -> "Project Options" -> "Sweave" -> "Weave Rnw files using:" and select knitr.  
* `JSE_revised.pdf`:  PDF of document
* `JSE_revised.bib`:  bibliography file
* `JSE_revised.R`:  R code used in document
* `okcupid_codebook_revised.txt`:  codebook for all variables
* `profiles_revised.csv.zip`: CSV file of revised profile data (unzip this first)
* `essays_revised_and_shuffled.zip`: CSV file of shuffled essay data (unzip this first)

## Notes

* **Revisions in 2021**: 
    + Removed the exact date the data was collected.
    + Shuffled/randomized the order of the rows of the essay data, thereby decoupling the essay data from the profiles data.
    + Removed URL's in the essay data that involved the following domains: facebook.com, instagram.com, twitter.com, pinterest.com, and flickr.com.
    + Added random noise to age variable in the profiles data.
    + Removed the following variables from the profiles data: location, last online
* **Original version from 2015**:
    + Permission to use this data set was explicitly granted by OkCupid.
    + Usernames are not included.

## Preview

### Distribution of Male and Female Heights

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=4, fig.width=6}
# This unzips the necessary .csv file the first time you run this code.
if(!file.exists("profiles_revised.csv")){
  unzip("profiles_revised.csv.zip")
}
profiles <- read.csv(file="profiles_revised.csv", header=TRUE, stringsAsFactors=FALSE)
```

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache}
profiles.subset <- 
  filter(profiles, height>=55 & height <=80) %>%
  mutate(
    sex = ifelse(sex == "m", "male", sex),
    sex = ifelse(sex == "f", "female", sex)
  )
histogram(~height | sex, width=1, layout=c(1,2), xlab="Height in inches", data=profiles.subset)
```


### Joint Distribution of Sex and Sexual Orientation

A mosaicplot of the cross-classification of the `r nrow(profiles)` users' selected sex and sexual orientation:  

```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=5, fig.width=5}
sex.by.orientation <- tally(~sex + orientation, data=profiles)
mosaicplot(sex.by.orientation, main="Sex vs Orientation", las=1)
```


### Logistic Regression to Predict Sex

Linear regression (in red) and logistic regression (in blue) compared.  Note both the x-axis (height) and y-axis (is female: 1 if user is female, 0 if user is male) have random jitter added to better visualize the number of points involved for each (height x sex) pair.

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=3, fig.width=6}
profiles <- filter(profiles, height>=55 & height <=80)
set.seed(76)
profiles <- sample_n(profiles, 5995)
profiles <- mutate(profiles, is.female = ifelse(sex=="f", 1, 0))

# Linear Regression
linear.model <- lm(is.female ~ height, data=profiles)
b1 <- coef(linear.model)

# Logistic Regression
logistic.model <- glm(is.female ~ height, family=binomial, data=profiles)
b2 <- coefficients(logistic.model)

inverse.logit <- function(x, b){
  linear.equation <- b[1] + b[2]*x
  1/(1+exp(-linear.equation))
}
ggplot(data=profiles, aes(x=height, y=is.female)) +
  geom_jitter(position = position_jitter(width = .2, height=.17)) +
  scale_y_continuous(breaks=0:1) +
  theme(panel.grid.minor.y = element_blank()) +
  xlab("Height in inches") +
  ylab("Is female?") +
  geom_abline(intercept=b1[1], slope=b1[2], col="red", size=2) +
  stat_function(fun = inverse.logit, args=list(b=b2), color="blue", size=2)
```

Fitted probabilities p-hat of each user being female along witha decision threshold (in red) used to predict if user is female or not.  

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=4, fig.width=4}
profiles$p.hat <- fitted(logistic.model)
ggplot(data=profiles, aes(x=p.hat)) +
  geom_histogram(binwidth=0.1) +
  xlab(expression(hat(p))) +
  ylab("Frequency") +
  xlim(c(0,1)) +
  geom_vline(xintercept=0.5, col="red", size=1.2)
```