forked from trasapong/R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
R4DS_7.Rmd
495 lines (329 loc) · 15 KB
/
R4DS_7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
---
title: "R4DS 7-Exploratory Data Analysis"
output:
ioslides_presentation:
incremental: yes
beamer_presentation:
incremental: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Exploratory Data Analysis (EDA)
## Exploratory Data Analysis (EDA)
1. Generate questions about your data.
2. Search for answers by visualising, transforming, and modelling your data.
3. Use what you learn to refine your questions and/or generate new questions.
## 7.1.1 Prerequisites
```{r}
library(tidyverse)
```
## 7.2 Questions
- EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.
- It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset.
- On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.
# 7.3 Variation
# 7.3.1 Visualising distributions
## Categorical var:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
## Categorical var:
```{r}
diamonds %>%
count(cut)
```
## Continuous var:
```{r}
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
```
## Continuous var:
```{r}
diamonds %>%
count(cut_width(carat, 0.5))
```
## Continuous var:
Zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
```{r, eval=FALSE}
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
```
## Continuous var:
```{r, echo=FALSE}
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
```
## Continuous var: {.smaller}
Overlay multiple histograms in the same plot
```{r}
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
```
## 7.3.2 Typical values
Look for anything unexpected:
- Which values are the most common? Why?
- Which values are rare? Why? Does that match your expectations?
- Can you see any unusual patterns? What might explain them?
## 7.3.2 Typical values
As an example, the histogram (next slide) suggests several interesting questions:
- Why are there more diamonds at whole carats and common fractions of carats?
- Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
- Why are there no diamonds bigger than 3 carats?
## 7.3.2 Typical values {.smaller}
```{r}
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
```
## 7.3.2 Typical values
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
- How are the observations within each cluster similar to each other?
- How are the observations in separate clusters different from each other?
- How can you explain or describe the clusters?
- Why might the appearance of clusters be misleading?
## 7.3.2 Typical values {.smaller}
The histogram below shows the length (in mins) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 mins) and long eruptions (4-5 mins), but little in between.
```{r, echo=FALSE}
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
```
# 7.3.3 Unusual values
## 7.3.3 Unusual values
- Outliers are observations that are unusual; data points that don’t seem to fit the pattern.
- Sometimes outliers are data entry errors; other times outliers suggest important new science.
- When you have a lot of data, outliers are sometimes difficult to see in a histogram.
- For example, take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.
## 7.3.3 Unusual values {.smaller}
```{r}
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
```
## 7.3.3 Unusual values {.smaller}
```{r}
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50)) # zoom in
```
## 7.3.3 Unusual values {.smaller}
```{r}
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
```
x,y,z variables measure three dimensions of diamonds. Impossible to have these values!
# 7.4 Missing values
## 7.4 Missing values
1.Drop the entire row with the strange values:
```{r}
diamonds2 <- diamonds %>%
filter(between(y, 3, 20))
```
2.Replace the unusual values with missing values. (a bit better)
```{r}
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
```
## 7.4 Missing values {.smaller}
```{r, out.width='80%'}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point() # show warning
```
## 7.4 Missing values {.smaller}
```{r, out.width='80%'}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE) # suppress the warnings
```
## 7.4 Missing values
- In `nycflights13::flights`, missing values in the dep_time variable indicate that the flight was cancelled.
- So you might want to compare the scheduled departure times for cancelled and non-cancelled times.
- You can do this by making a new variable with `is.na()`.
## 7.4 Missing values
```{r, eval = FALSE}
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
```
## 7.4 Missing values {.smaller}
```{r, echo=FALSE}
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
```
However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.
# 7.5 Covariation
## 7.5 Covariation
- If variation describes the behavior within a variable, covariation describes the behavior between variables.
- Covariation is the tendency for the values of two or more variables to vary together in a related way.
- The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.
## 7.5.1 Categorical-continuous variable {.smaller}
```{r}
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
## 7.5.1 Categorical-continuous variable {.smaller}
It’s hard to see the difference in distribution because the overall counts differ so much:
```{r, out.width='90%'}
ggplot(diamonds) + geom_bar(mapping = aes(x = cut))
```
## 7.5.1 Categorical-continuous variable {.smaller}
To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying **count**, we’ll display **density**:
```{r, out.width='60%'}
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```
There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price!
## Boxplot {.smaller}
Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. Each boxplot consists of:
- A box that stretches from the 25th percentile - 75th percentile of the distribution, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.
- A line (or whisker) that extends from each end of the box and goes to the
farthest non-outlier point in the distribution.
## Boxplot
```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics('https://d33wubrfki0l68.cloudfront.net/153b9af53b33918353fda9b691ded68cd7f62f51/5b616/images/eda-boxplot.png')
```
## Boxplot
```{r}
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
```
## Boxplot
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the `reorder()` function.
## Boxplot {.smaller}
For example, take the class variable in the mpg dataset. You might be interested to know how highway mileage varies across classes:
```{r, out.width='80%'}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
```
## Boxplot {.smaller}
To make the trend easier to see, we can reorder `class` based on the median value of `hwy`:
```{r, out.width='80%'}
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
```
## Boxplot {.smaller}
If you have long variable names, `geom_boxplot()` will work better if you flip it 90°.
```{r, out.width='80%'}
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
```
## 7.5.2 Two categorical variables {.smaller}
```{r, out.width='90%'}
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
```
## 7.5.2 Two categorical variables {.smaller}
Another approach is to compute the count with `dplyr`:
```{r}
diamonds %>%
count(color, cut)
```
## 7.5.2 Two categorical variables {.smaller}
Then visualise with `geom_tile()` and the fill aesthetic:
```{r, out.width='70%'}
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
```
## 7.5.3 Two continuous variables {.smaller}
```{r, out.width='90%'}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
```
## 7.5.3 Two continuous variables {.smaller}
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot.
```{r, out.width='80%'}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
```
## 7.5.3 Two continuous variables {.smaller}
```{r, out.width='50%', fig.show='hold'}
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
```
## 7.5.3 Two continuous variables {.smaller}
Another option is to bin one continuous variable so it acts like a categorical variable.
```{r, out.width='80%'}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
```
## 7.5.3 Two continuous variables {.smaller}
Another approach is to display approximately the same number of points in each bin.
```{r, out.width='80%'}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
```
## 7.6 Patterns and models
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
- Could this pattern be due to coincidence (i.e. random chance)?
- How can you describe the relationship implied by the pattern?
- How strong is the relationship implied by the pattern?
- What other variables might affect the relationship?
- Does the relationship change if you look at individual subgroups of the data?
## 7.5.3 Two continuous variables {.smaller}
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters.
```{r, out.width='70%'}
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
```
## Patterns
- Patterns provide one of the most useful tools for data scientists because they reveal covariation.
- If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it.
- If two variables covary, you can use the values of one variable to make better predictions about the values of the second.
- If the covariation is due to a **causal** relationship (a special case), then you can use the value of one variable to control the value of the second.
## Models
- Models are a tool for extracting patterns out of data.
- For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related.
- It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain.
## Models {.smaller}
The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
```{r, eval=FALSE}
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
```
## Models
```{r, echo = FALSE}
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
```
## Models {.smaller}
Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
```{r, out.width='70%'}
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
```
# End of Chapter