-
Notifications
You must be signed in to change notification settings - Fork 6
/
R4DS_12.Rmd
568 lines (372 loc) · 17.5 KB
/
R4DS_12.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
---
title: "R4DS-12 Tidy Data"
output:
ioslides_presentation:
incremental: yes
beamer_presentation:
incremental: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{css, echo=FALSE}
pre {
max-height: 500px;
overflow-y: auto;
}
pre[class] {
max-height: 300px;
}
.scroll-100 {
max-height: 300px;
overflow-y: auto;
background-color: inherit;
}
```
# 12 Tidy Data
## Tidy Data : Introduction
- In this chapter, we will learn a consistent way to organise your data in R, an organisation called **tidy data**.
- Getting your data into this format requires some upfront work, but that work pays off in the long term.
- Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
## 12.1.1 Prerequisites {.smaller}
In this chapter we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.
```{r}
library(tidyverse)
```
## 12.2 Tidy Data
- You can represent the same underlying data in multiple ways.
- The example below shows the same data organised in four different ways.
- Each dataset shows the same values of four variables country, year, population, and cases, but each dataset organises the values in a different way.
## Ex: table1
```{r}
table1
```
## Ex: table2
```{r}
table2
```
## Ex: table3
```{r}
table3
```
## Ex: table4a and table4b {.smaller}
```{r}
# Spread across two tibbles
table4a # cases
table4b # population
```
## 12.2 Tidy Data
- These are all representations of the same underlying data, but they are not equally easy to use.
- One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
## 12.2 Tidy Data
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
## 12.2 Tidy Data
Three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.
```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics('https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png')
```
## 12.2 Tidy Data
In this example, only `table1` is tidy. It’s the only representation where each column is a variable.
Why ensure that your data is tidy? 2 main advantages:
- There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
- There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
## 12.2 Tidy Data
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`.
```{r}
# Compute rate per 10,000
table1 %>%
mutate(rate = cases / population * 10000)
```
## 12.2 Tidy Data
```{r}
# Compute cases per year
table1 %>%
count(year, wt = cases)
```
## 12.2 Tidy Data {.smaller}
```{r, out.width='80%', fig.align='center'}
# Visualise changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
```
# 12.3 Pivoting
## 12.3 Pivoting
The principles of tidy data seem so obvious that you might wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most data that you will encounter will be untidy. There are 2 main reasons:
1. Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.
2. Data is often organised to facilitate some use other than analysis. For example, data is often organised to make entry as easy as possible.
## 12.3 Pivoting
This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:
1. One variable might be spread across multiple columns.
2. One observation might be scattered across multiple rows.
## 12.3 Pivoting
Typically a dataset will only suffer from one of these problems; it’ll only suffer from both if you’re really unlucky! To fix these problems, you’ll need the two most important functions in tidyr: `pivot_longer()` and `pivot_wider()`.
## 12.3.1 Longer
A common problem is a dataset where some of the column names are not names of variables, but values of a variable.
Take table4a: the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.
```{r}
table4a
```
## 12.3.1 Longer
To tidy a dataset like this, we need to **pivot** the offending columns into a new pair of variables. To describe that operation we need three parameters:
- The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000.
- The name of the variable to move the column names to. Here it is year.
- The name of the variable to move the column values to. Here it’s cases.
## 12.3.1 Longer
Together those parameters generate the call to `pivot_longer()`:
```{r}
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
```
## 12.3.1 Longer
Pivoting `table4` into a longer, tidy form.
```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics('https://d33wubrfki0l68.cloudfront.net/3aea19108d39606bbe49981acda07696c0c7fcd8/2de65/images/tidy-9.png')
```
## 12.3.1 Longer
We can use `pivot_longer()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
```{r}
table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
```
## 12.3.1 Longer {.smaller}
To combine the tidied versions of `table4a` and `table4b` into a single tibble, we need to use `dplyr::left_join()`, which you’ll learn about in relational data.
```{r}
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
```
## 12.3.2 Wider {.smaller}
`pivot_wider()` is the opposite of `pivot_longer()`. You use it when an observation is scattered across multiple rows. For example, take `table2`: an observation is a country in a year, but each observation is spread across two rows.
```{r}
table2
```
## 12.3.2 Wider
To tidy this up, we first analyse the representation in similar way to `pivot_longer()`. This time, however, we only need two parameters:
- The column to take variable names from. Here, it’s `type`.
- The column to take values from. Here it’s `count`.
## 12.3.2 Wider
```{r}
table2 %>%
pivot_wider(names_from = type, values_from = count)
```
## 12.3.2 Wider
Pivoting `table2` into a “wider”, tidy form.
```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics('https://d33wubrfki0l68.cloudfront.net/8350f0dda414629b9d6c354f87acf5c5f722be43/bcb84/images/tidy-8.png')
```
*colname of `table2` should be `type` and `count`.
## 12.4 Separating and uniting
- So far you’ve learned how to tidy `table2` and `table4`, but not `table3`.
- `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`).
- To fix this problem, we’ll need the `separate()` function. You’ll also learn about the complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
## 12.4.1 Separate
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take `table3`:
```{r}
table3
```
## 12.4.1 Separate
```{r}
table3 %>%
separate(rate, into = c("cases", "population"))
```
## 12.4.1 Separate
Separating `table3` makes it tidy
```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics('https://d33wubrfki0l68.cloudfront.net/f6fca537e77896868fedcd85d9d01031930d76c9/637d9/images/tidy-17.png')
```
## 12.4.1 Separate
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter).
For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:
```{r, eval=FALSE}
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
```
## 12.4.1 Separate
Look carefully at the column types: you’ll notice that `cases` and `population` are character columns.
This is the default behavior in `separate()`: it leaves the type of the column as is. We can ask `separate()` to convert to better types using `convert = TRUE`:
```{r}
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
```
## 12.4.1 Separate
Separate the last two digits of each year:
```{r}
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
```
*sep value is the position to split at
## 12.4.2 Unite
`unite()` is the inverse of `separate()`: it combines multiple columns into a single column. You’ll need it much less frequently than `separate()`, but it’s still a useful tool.
```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics('https://d33wubrfki0l68.cloudfront.net/3d98d3ba019fed3f9ee328284568d4508e479ef8/0b3e6/images/tidy-18.png')
```
## 12.4.2 Unite
```{r}
table5 %>%
unite(new, century, year)
```
## 12.4.2 Unite
```{r}
table5 %>%
unite(new, century, year, sep = "")
```
## 12.5 Missing values
```{r}
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
```
2 missing values in this dataset:
- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA.
- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
## 12.5 Missing values
The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return)
```
## 12.5 Missing values {.smaller}
Because these explicit missing values may not be important in other representations of the data, you can set `values_drop_na = TRUE` in `pivot_longer()` to turn explicit missing values implicit:
```{r}
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
```
## 12.5 Missing values
Another important tool for making missing values explicit in tidy data is `complete()`:
```{r}
stocks %>%
complete(year, qtr)
```
## 12.5 Missing values
There’s one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
```{r}
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
```
## 12.5 Missing values
You can fill in these missing values with `fill()`:
```{r}
treatment %>%
fill(person)
```
## 12.6 Case Study {.smaller}
The `tidyr::who` dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report.
```{r, class.output="scroll-100"}
who
```
## 12.6 Case Study
```{r}
names(who)
```
## 12.6 Case Study
- This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values.
- In short, `who` is messy, and we’ll need multiple steps to tidy it.
- Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you’ll usually need to string together multiple verbs into a pipeline.
## 12.6 Case Study
The best place to start is almost always to gather together the columns that are not variables. Let’s have a look at what we’ve got:
- It looks like `country`, `iso2`, and `iso3` are three variables that redundantly specify the country.
- `year` is clearly also a variable.
- We don’t know what all the other columns are yet, but given the structure in the variable names (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`) these are likely to be values, not variables.
## 12.6 Case Study
- So we need to gather together all the columns from `new_sp_m014` to `newrel_f65`.
- We don’t know what those values represent yet, so we’ll give them the generic name `"key"`.
- We know the cells represent the count of `cases`, so we’ll use the variable `cases`.
- There are a lot of missing values in the current representation, so for now we’ll use `values_drop_na` just so we can focus on the values that are present.
## 12.6 Case Study {.smaller}
```{r}
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
```
## 12.6 Case Study {.smaller}
We can get some hint of the structure of the values in the new `key` column by counting them:
```{r}
who1 %>%
count(key)
```
## 12.6 Case Study {.smaller}
You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:
1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
2. The next two letters describe the type of TB: rel = relapse, ep = extrapulmonary TB, sn = pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative), sp = pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)
3. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
4. The remaining numbers gives the age group. The dataset groups cases into 7 age groups: 014 = 0 – 14 years old, 1524 = 15 – 24 years old, 2534 = 25 – 34 years old, 3544 = 35 – 44 years old, 4554 = 45 – 54 years old, 5564 = 55 – 64 years old, 65 = 65 or older
## 12.6 Case Study
- We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel` we have `newrel` (it’s hard to spot this here but if you don’t fix it we’ll get errors in subsequent steps).
- You’ll learn about `str_replace()` in strings, but the basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.
## 12.6 Case Study
```{r}
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
```
## 12.6 Case Study
```{r}
unique(who2$key)
```
## 12.6 Case Study {.smaller}
We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.
```{r}
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
who3
```
## 12.6 Case Study
Then we might as well drop the `new` column because it’s constant in this dataset. While we’re dropping columns, let’s also drop `iso2` and `iso3` since they’re redundant.
```{r}
who3 %>%
count(new)
who4 <- who3 %>%
select(-new, -iso2, -iso3)
```
## 12.6 Case Study {.smaller}
Next we’ll separate sexage into `sex` and `age` by splitting after the first character:
```{r}
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5
```
## 12.6 Case Study {.smaller}
The `who` dataset is now tidy!
Put them all together:
```{r, eval=FALSE}
who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
) %>%
mutate(
key = stringr::str_replace(key, "newrel", "new_rel")
) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
```
# End of Chapter