-
Notifications
You must be signed in to change notification settings - Fork 68
/
README.Rmd
129 lines (106 loc) · 4.91 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
title: "OkCupid Profile Data for Intro Stats and Data Science Courses (Revised in 2021)"
author: "Albert Y. Kim and Adriana Escobedo-Land"
output: github_document
references:
- id: Kim2015
title: OkCupid Profile Data for Introductory Statistics and Data Science Courses
author:
- family: Kim
given: Albert Y. Kim
- family: Escobedo-Land
given: Adriana
ISSN: 1069-1898
volume: 23
URL: 'http://www.amstat.org/publications/jse/v23n2/kim.pdf'
publisher: American Statistical Association
issue: 2
type: article-journal
issued:
year: 2015
month: 7
---
```{r, include=FALSE}
library(mosaic)
library(dplyr)
library(ggplot2)
library(knitr)
```
Data and code for [OkCupid Profile Data for Introductory Statistics and Data Science Courses](https://www.tandfonline.com/doi/abs/10.1080/10691898.2015.11889737), Journal of Statistics Education July 2015, Volume 23, Number 2. **The original manuscript was subsequently revised in 2021.**
* `JSE_revised.Rnw`: `.Rnw` source document to recreate `JSE_revised.pdf` using `knitr`. In RStudio, go to "Tools" -> "Project Options" -> "Sweave" -> "Weave Rnw files using:" and select knitr.
* `JSE_revised.pdf`: PDF of document
* `JSE_revised.bib`: bibliography file
* `JSE_revised.R`: R code used in document
* `okcupid_codebook_revised.txt`: codebook for all variables
* `profiles_revised.csv.zip`: CSV file of revised profile data (unzip this first)
* `essays_revised_and_shuffled.zip`: CSV file of shuffled essay data (unzip this first)
## Notes
* **Revisions in 2021**:
+ Removed the exact date the data was collected.
+ Shuffled/randomized the order of the rows of the essay data, thereby decoupling the essay data from the profiles data.
+ Removed URL's in the essay data that involved the following domains: facebook.com, instagram.com, twitter.com, pinterest.com, and flickr.com.
+ Added random noise to age variable in the profiles data.
+ Removed the following variables from the profiles data: location, last online
* **Original version from 2015**:
+ Permission to use this data set was explicitly granted by OkCupid.
+ Usernames are not included.
## Preview
### Distribution of Male and Female Heights
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=4, fig.width=6}
# This unzips the necessary .csv file the first time you run this code.
if(!file.exists("profiles_revised.csv")){
unzip("profiles_revised.csv.zip")
}
profiles <- read.csv(file="profiles_revised.csv", header=TRUE, stringsAsFactors=FALSE)
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache}
profiles.subset <-
filter(profiles, height>=55 & height <=80) %>%
mutate(
sex = ifelse(sex == "m", "male", sex),
sex = ifelse(sex == "f", "female", sex)
)
histogram(~height | sex, width=1, layout=c(1,2), xlab="Height in inches", data=profiles.subset)
```
### Joint Distribution of Sex and Sexual Orientation
A mosaicplot of the cross-classification of the `r nrow(profiles)` users' selected sex and sexual orientation:
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=5, fig.width=5}
sex.by.orientation <- tally(~sex + orientation, data=profiles)
mosaicplot(sex.by.orientation, main="Sex vs Orientation", las=1)
```
### Logistic Regression to Predict Sex
Linear regression (in red) and logistic regression (in blue) compared. Note both the x-axis (height) and y-axis (is female: 1 if user is female, 0 if user is male) have random jitter added to better visualize the number of points involved for each (height x sex) pair.
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=3, fig.width=6}
profiles <- filter(profiles, height>=55 & height <=80)
set.seed(76)
profiles <- sample_n(profiles, 5995)
profiles <- mutate(profiles, is.female = ifelse(sex=="f", 1, 0))
# Linear Regression
linear.model <- lm(is.female ~ height, data=profiles)
b1 <- coef(linear.model)
# Logistic Regression
logistic.model <- glm(is.female ~ height, family=binomial, data=profiles)
b2 <- coefficients(logistic.model)
inverse.logit <- function(x, b){
linear.equation <- b[1] + b[2]*x
1/(1+exp(-linear.equation))
}
ggplot(data=profiles, aes(x=height, y=is.female)) +
geom_jitter(position = position_jitter(width = .2, height=.17)) +
scale_y_continuous(breaks=0:1) +
theme(panel.grid.minor.y = element_blank()) +
xlab("Height in inches") +
ylab("Is female?") +
geom_abline(intercept=b1[1], slope=b1[2], col="red", size=2) +
stat_function(fun = inverse.logit, args=list(b=b2), color="blue", size=2)
```
Fitted probabilities p-hat of each user being female along witha decision threshold (in red) used to predict if user is female or not.
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=4, fig.width=4}
profiles$p.hat <- fitted(logistic.model)
ggplot(data=profiles, aes(x=p.hat)) +
geom_histogram(binwidth=0.1) +
xlab(expression(hat(p))) +
ylab("Frequency") +
xlim(c(0,1)) +
geom_vline(xintercept=0.5, col="red", size=1.2)
```