-
Notifications
You must be signed in to change notification settings - Fork 90
/
03_summarising.Rmd
executable file
·725 lines (526 loc) · 29.6 KB
/
03_summarising.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
---
output:
html_document: default
pdf_document: default
---
```{r setup, include = FALSE}
options(pillar.sigfig = 4)
library(kableExtra)
```
# Summarising data
\index{summarising data@\textbf{summarising data}}
>“The Answer to the Great Question ... Of Life, the Universe and Everything ... Is ... Forty-two," said Deep Thought, with infinite majesty and calm.
>Douglas Adams, The Hitchhiker's Guide to the Galaxy
In this chapter you will find out how to:
* summarise data using: `group_by()`, `summarise()`, and `mutate()`;
* reshape data between the wide and long formats: `pivot_wider()` and `pivot_longer()`;
* `select()` columns and `arrange()` (sort) rows.
The exercises at the end of this chapter combine all of the above to give context and show you more worked examples.
## Get the data
```{r, include = FALSE}
source("healthyr_theme.R")
gbd_short <- read_csv("data/global_burden_disease_cause-year.csv")
```
Dataset: Global Burden of Disease (year, cause, sex, income, deaths)
The Global Burden of Disease dataset used in this chapter is more detailed than the one we used previously.
For each year, the total number of deaths from the three broad disease categories are also separated into sex and World Bank income categories.
This means that we have 24 rows for each year, and that the total number of deaths per year is the sum of these 24 rows:
```{r, message=F}
library(tidyverse)
gbd_full <- read_csv("data/global_burden_disease_cause-year-sex-income.csv")
# Creating a single-year tibble for printing and simple examples:
gbd2017 <- gbd_full %>%
filter(year == 2017)
```
```{r chap3-tab-gbd2017, echo = FALSE}
gbd2017 %>%
knitr::kable(booktabs = TRUE,
linesep = c(rep("", 3), "\\addlinespace"),
align = c("l", "c", "r", "l", "c", "r"),
caption = "Deaths per year (2017) from three broad disease categories, sex, and World Bank country-level income groups.") %>%
kableExtra::kable_styling(latex_options = c("hold_position"),
font_size = 10)
```
\clearpage
## Plot the data
The best way to investigate a dataset is of course to plot it.
We have added a couple of notes as comments (the lines starting with a `#`) for those who can't wait to get to the next chapter where the code for plotting will be introduced and explained in detail.
Overall, you shouldn't waste time trying to understand this code, but do look at the different groups within this new dataset.
```{r chap03-fig-gbd, fig.height=3, fig.width=6, fig.cap="Global Burden of Disease data with subgroups: cause, sex, World Bank income group."}
gbd2017 %>%
# without the mutate(... = fct_relevel())
# the panels get ordered alphabetically
mutate(income = fct_relevel(income,
"Low",
"Lower-Middle",
"Upper-Middle",
"High")) %>%
# defining the variables using ggplot(aes(...)):
ggplot(aes(x = sex, y = deaths_millions, fill = cause)) +
# type of geom to be used: column (that's a type of barplot):
geom_col(position = "dodge") +
# facets for the income groups:
facet_wrap(~income, ncol = 4) +
# move the legend to the top of the plot (default is "right"):
theme(legend.position = "top")
```
## Aggregating: `group_by()`, `summarise()`
\index{summarising data@\textbf{summarising data}!aggregation}
\index{functions@\textbf{functions}!group\_by}
\index{functions@\textbf{functions}!summarise}
Health data analysis is frequently concerned with making comparisons between groups.
Groups of genes, or diseases, or patients, or populations, etc.
An easy approach to the comparison of data by a categorical grouping is therefore essential.
We will introduce flexible functions from __tidyverse__ that you can apply in any setting.
The examples intentionally get quite involved to demonstrate the different approaches that can be used.
<!-- We finish by showing how to quickly and easily create summary tables that can be used in your work. -->
To quickly calculate the total number of deaths in 2017, we can select the column and send it into the `sum()` function:
```{r}
gbd2017$deaths_millions %>% sum()
```
But a much cleverer way of summarising data is using the `summarise()` function:
```{r}
gbd2017 %>%
summarise(sum(deaths_millions))
```
This is indeed equal to the number of deaths per year we saw in the previous chapter using the shorter version of this data (deaths from the three causes were `r gbd_short %>%filter(year == 2017) %>% pull(deaths_millions)` which adds to `r gbd_short %>%filter(year == 2017) %>% pull(deaths_millions) %>% sum()`).
`sum()` is a function that adds numbers together, whereas `summarise()` is an efficient way of creating summarised tibbles.
The main strength of `summarise()` is how it works with the `group_by()` function.
`group_by()` and `summarise()` are like cheese and wine, a perfect complement for each other, seldom seen apart.
We use `group_by()` to tell `summarise()` which subgroups to apply the calculations on.
In the above example, without `group_by()`, summarise just works on the whole dataset, yielding the same result as just sending a single column into the `sum()` function.
We can subset on the cause variable using `group_by()`:
```{r}
gbd2017 %>%
group_by(cause) %>%
summarise(sum(deaths_millions))
```
Furthermore, `group_by()` is happy to accept multiple grouping variables.
So by just copying and editing the above code, we can quickly get summarised totals across multiple grouping variables (by just adding `sex` inside the `group_by()` after `cause`):
```{r}
gbd2017 %>%
group_by(cause, sex) %>%
summarise(sum(deaths_millions))
```
## Add new columns: `mutate()`
\index{summarising data@\textbf{summarising data}!create columns}
\index{functions@\textbf{functions}!mutate}
We met `mutate()` in the last chapter.
Let's first give the summarised column a better name, e.g., `deaths_per_group`.
We can remove groupings by using `ungroup()`.
This is important to remember if you want to manipulate the dataset in its original format.
We can combine `ungroup()` with `mutate()` to add a total deaths column, which will be used below to calculate a percentage:
```{r}
gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
ungroup() %>%
mutate(deaths_total = sum(deaths_per_group))
```
### Percentages formatting: `percent()`
\index{functions@\textbf{functions}!percent}
So `summarise()` condenses a tibble, whereas `mutate()` retains its current size and adds columns.
We can also add further lines to `mutate()` to calculate the percentage of each group:
```{r, message = FALSE}
# percent() function for formatting percentages come from library(scales)
library(scales)
gbd2017_summarised <- gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
ungroup() %>%
mutate(deaths_total = sum(deaths_per_group),
deaths_relative = percent(deaths_per_group/deaths_total))
gbd2017_summarised
```
The `percent()` function comes from `library(scales)` and is a handy way of formatting percentages
You must keep in mind that it changes the column from a number (denoted `<dbl>`) to a character (`<chr>`).
The `percent()` function is equivalent to:
```{r}
# using values from the first row as an example:
round(100*4.91/55.74, 1) %>% paste0("%")
```
This is convenient for final presentation of number, but if you intend to do further calculations/plot/sort the percentages just calculate them as fractions with:
```{r}
gbd2017_summarised %>%
mutate(deaths_relative = deaths_per_group/deaths_total)
```
and convert to nicely formatted percentages later with `mutate(deaths_percentage = percent(deaths_relative))`.
## `summarise()` vs `mutate()`
So far we've shown you examples of using `summarise()` on grouped data (following `group_by()`) and `mutate()` on the whole dataset (without using `group_by()`).
But here's the thing: `mutate()` is also happy to work on grouped data.
Let's save the aggregated example from above in a new tibble.
We will then sort the rows using `arrange()` based on `sex`, just for easier viewing (it was previously sorted by `cause`).
The `arrange()` function sorts the rows within a tibble:
```{r}
gbd_summarised <- gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
arrange(sex)
gbd_summarised
```
You should also notice that `summarise()` drops all variables that are not listed in `group_by()` or created inside it.
So `year`, `income`, and `deaths_millions` exist in `gbd2017`, but they do not exist in `gbd_summarised`.
We now want to calculate the percentage of deaths from each cause for each gender.
We could use `summarise()` to calculate the totals:
```{r}
gbd_summarised_sex <- gbd_summarised %>%
group_by(sex) %>%
summarise(deaths_per_sex = sum(deaths_per_group))
gbd_summarised_sex
```
But that drops the `cause` and `deaths_per_group` columns.
One way would be to now use a join on `gbd_summarised` and `gbd_summarised_sex`:
```{r}
full_join(gbd_summarised, gbd_summarised_sex)
```
Joining different summaries together can be useful, especially if the individual pipelines are quite long (e.g., over 5 lines of `%>%`).
However, it does increase the chance of mistakes creeping in and is best avoided if possible.
An alternative is to use `mutate()` with `group_by()` to achieve the same result as the `full_join()` above:
```{r}
gbd_summarised %>%
group_by(sex) %>%
mutate(deaths_per_sex = sum(deaths_per_group))
```
So `mutate()` calculates the sums within each grouping variable (in this example just `group_by(sex)`) and puts the results in a new column without condensing the tibble down or removing any of the existing columns.
Let's combine all of this together into a single pipeline and calculate the percentages per cause for each gender:
```{r}
gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
group_by(sex) %>%
mutate(deaths_per_sex = sum(deaths_per_group),
sex_cause_perc = percent(deaths_per_group/deaths_per_sex)) %>%
arrange(sex, deaths_per_group)
```
## Common arithmetic functions - `sum()`, `mean()`, `median()`, etc.
\index{functions@\textbf{functions}!arithmetic}
\index{summarising data@\textbf{summarising data}!arithmetic functions}
Statistics is an R strength, so if there is an arithmetic function you can think of, it probably exists in R.
The most common ones are:
* `sum()`
* `mean()`
* `median()`
* `min()`, `max()`
* `sd()` - standard deviation
* `IQR()` - interquartile range
An import thing to remember relates to missing data: if any of your values is NA (not available; missing), these functions will return an NA.
Either deal with your missing values beforehand (recommended) or add the `na.rm = TRUE` argument into any of the functions to ask R to ignore missing values.
More discussion and examples around missing data can be found in Chapters \@ref(r-basics) and \@ref(chap11-h1).
```{r}
mynumbers <- c(1, 2, NA)
sum(mynumbers)
sum(mynumbers, na.rm = TRUE)
```
Overall, R's unwillingness to implicitly average over observations with missing values should be considered helpful, not an unnecessary pain.
If you don't know exactly where your missing values are, you might end up comparing the averages of different groups.
So the `na.rm = TRUE` is fine to use if quickly exploring and cleaning data, or if you've already investigated missing values and are convinced the existing ones are representative.
But it is rightfully not a default so get used to typing `na.rm = TRUE` when using these functions.
## `select()` columns
\index{summarising data@\textbf{summarising data}!select columns}
\index{functions@\textbf{functions}!select}
The `select()` function can be used to choose, rename, or reorder columns of a tibble.
For the following `select()` examples, let's create a new tibble called `gbd_2rows` by taking the first 2 rows of `gbd_full` (just for shorter printing):
```{r}
gbd_2rows <- gbd_full %>%
slice(1:2)
gbd_2rows
```
Let's `select()` two of these columns:
```{r}
gbd_2rows %>%
select(cause, deaths_millions)
```
We can also use `select()` to rename the columns we are choosing:
```{r}
gbd_2rows %>%
select(cause, deaths = deaths_millions)
```
The function `rename()` is similar to `select()`, but it keeps all variables whereas `select()` only kept the ones we mentioned:
```{r}
gbd_2rows %>%
rename(deaths = deaths_millions)
```
`select()` can also be used to reorder the columns in your tibble. Moving columns around is not relevant in data analysis (as any of the functions we showed you above, as well as plotting, only look at the column names, and not their positions in the tibble), but it is useful for organising your tibble for easier viewing.
So if we use select like this:
```{r}
gbd_2rows %>%
select(year, sex, income, cause, deaths_millions)
```
The columns are reordered.
If you want to move specific column(s) to the front of the tibble, do:
```{r}
gbd_2rows %>%
select(year, sex, everything())
```
And this is where the true power of `select()` starts to come out.
In addition to listing the columns explicitly (e.g., `mydata %>% select(year, cause...)`) there are several special functions that can be used inside `select()`.
These special functions are called select helpers, and the first select helper we used is `everything()`.
The most common select helpers are `starts_with()`, `ends_with()`, `contains()`, `matches()` (but there are several others that may be useful to you, so press F1 on `select()` for a full list, or search the web for more examples).
Let's say you can't remember whether the deaths column was called `deaths_millions` or just `deaths` or `deaths_mil`, or maybe there are other columns that include the word "deaths" that you want to `select()`:
```{r}
gbd_2rows %>%
select(starts_with("deaths"))
```
Note how "deaths" needs to be quoted inside `starts_with()` - as it's a word to look for, not the real name of a column/variable.
## Reshaping data - long vs wide format
\index{summarising data@\textbf{summarising data}!long vs wide data}
So far, all of the examples we've shown you have been using 'tidy' data.
Data is 'tidy' when it follows a couple of rules: *each variable is in its own column*, and *each observation is in its own row*.
Making data 'tidy' often means transforming the table from a "wide" format into a "long" format.
Long format is efficient to use in data analysis and visualisation and can also be considered "computer readable".
But sometimes when presenting data in tables for humans to read, or when collecting data directly into a spreadsheet, it can be convenient to have data in a wide format.
Data is 'wide' when *some or all of the columns are levels of a factor*.
An example makes this easier to see.
```{r, message = FALSE}
gbd_wide <- read_csv("data/global_burden_disease_wide-format.csv")
gbd_long <- read_csv("data/global_burden_disease_cause-year-sex.csv")
```
```{r chap3-tab-gbd-wide, echo = FALSE}
gbd_wide %>%
knitr::kable(booktabs = TRUE,
linesep = c(rep("", 3), "\\addlinespace"),
align = c("l", "c", "c", "c", "c"),
caption = "Global Burden of Disease data in human-readable wide format. This is not tidy data.") %>%
kableExtra::kable_styling(latex_options = c("scale_down", "hold_position"))
```
```{r chap3-tab-gbd-long, echo = FALSE}
gbd_long %>%
knitr::kable(booktabs = TRUE,
linesep = c(rep("", 3), "\\addlinespace"),
align = c("l", "c", "c", "c", "c"),
caption = "Global Burden of Disease data in analysis-friendly long format. This is tidy data.") %>%
kableExtra::kable_styling(latex_options = c("hold_position"),
font_size = 10)
```
Tables \@ref(tab:chap3-tab-gbd-long) and \@ref(tab:chap3-tab-gbd-wide) contain the exact same information, but in long (tidy) and wide formats, respectively.
### Pivot values from rows into columns (wider)
\index{summarising data@\textbf{summarising data}!convert long to wide}
\index{functions@\textbf{functions}!pivot\_wider}
If we want to take the long data from \@ref(tab:chap3-tab-gbd-long) and put some of the numbers next to each other for easier visualisation, then `pivot_wider()` from the __tidyr__ package is the function to do it.
It means we want to send a variable into columns, and it needs just two arguments: the variable we want to become the new columns, and the variable where the values currently are.
```{r}
gbd_long %>%
pivot_wider(names_from = year, values_from = deaths_millions)
```
This means we can quickly eyeball how the number of deaths has changed from 1990 to 2017 for each cause category and sex.
Whereas if we wanted to quickly look at the difference in the number of deaths for females and males, we can change the `names_from = ` argument from ` = years` to ` = sex`.
Furthermore, we can also add a `mutate()` to calculate the difference:
```{r}
gbd_long %>%
pivot_wider(names_from = sex, values_from = deaths_millions) %>%
mutate(Male - Female)
```
All of these differences are positive which means every year, more men die than women.
Which make sense, as more boys are born than girls.
And what if we want to look at both `year` and `sex` at the same time, so to create Table \@ref(tab:chap3-tab-gbd-wide) from Table \@ref(tab:chap3-tab-gbd-long)?
No problem, `pivot_wider()` can deal with multiple variables at the same time, `names_from = c(sex, year)`:
```{r}
gbd_long %>%
pivot_wider(names_from = c(sex, year), values_from = deaths_millions)
```
`pivot_wider()` has a few optional arguments that may be useful for you.
For example, `pivot_wider(..., values_fill = 0)` can be used to fill empty cases (if you have any) with a value you specified.
Or `pivot_wider(..., names_sep = ": ")` can be used to change the separator that gets put between the values (e.g., you may want "Female: 1990" instead of the default "Female_1990").
Remember that pressing F1 when your cursor is on a function opens it up in the Help tab where these extra options are listed.
### Pivot values from columns to rows (longer)
\index{summarising data@\textbf{summarising data}!convert wide to long}
\index{functions@\textbf{functions}!pivot\_longer}
The inverse of `pivot_wider()` is `pivot_longer()`.
If you're lucky enough, your data comes from a proper database and is already in the long and tidy format.
But if you do get landed with something that looks like Table \@ref(tab:chap3-tab-gbd-wide), you'll need to know how to wrangle the variables currently spread across different columns into the tidy format (where each column is a variable, each row is an observation).
`pivot_longer()` can be a little bit more difficult to use as you need to describe all the columns to be collected using a `select_helper`. Run `?select_helpers and click on the first result in the Help tab for a reminder.
For example, here we want to collect all the columns that include the words Female or Male, the select helper for it is `matches("Female|Male")`:
```{r}
gbd_wide %>%
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
slice(1:6)
```
You're probably looking at the example above and thinking that's all nice and simple on this miniature example dataset, but how on earth will I figure this out on a real-world example.
And you're right, we won't deny that `pivot_longer()` is one of the most technically complicated functions in this book, and it can take a lot of trial and error to get it to work.
How to get started with your own `pivot_longer()` transformation is to first play with the `select()` function to make sure you are telling R exactly which columns to pivot into the longer format.
For example, before working out the `pivot_longer()` code for the above example, we would figure this out first:
```{r}
gbd_wide %>%
select(matches("Female|Male"))
```
Then, knowing that `matches("Female|Male")` works as expected inside our little `select()` test, we can copy-paste it into `pivot_longer()` and add the `names_to` and `values_to` arguments. Both of these arguments are new column names that you can make up (in the above example, we are using "sex_year" and "deaths_millions").
### `separate()` a column into multiple columns
While `pivot_longer()` did a great job fetching the different observations that were spread across multiple columns into a single one, it's still a combination of two variables - sex and year.
We can use the `separate()` function to deal with that.
```{r}
gbd_wide %>%
# same pivot_longer as before
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
separate(sex_year, into = c("sex", "year"), sep = "_", convert = TRUE)
```
We've also added `convert = TRUE` to `separate()` so `year` would get converted into a numeric variable.
The combination of, e.g., "Female-1990" is a character variable, so after separating them both `sex` and `year` would still be classified as characters.
But the `convert = TRUE` recognises that `year` is a number and will appropriately convert it into an integer.
## `arrange()` rows
\index{summarising data@\textbf{summarising data}!arrange / order rows}
\index{functions@\textbf{functions}!arrange}
The `arrange()` function sorts rows based on the column(s) you want. By default, it arranges the tibble in ascending order:
```{r}
gbd_long %>%
arrange(deaths_millions) %>%
# first 3 rows just for printing:
slice(1:3)
```
For numeric variables, we can just use a `-` to sort in descending order:
```{r}
gbd_long %>%
arrange(-deaths_millions) %>%
slice(1:3)
```
The `-` doesn't work for categorical variables; they need to be put in `desc()` for arranging in descending order:
```{r}
gbd_long %>%
arrange(desc(sex)) %>%
# printing rows 1, 2, 11, and 12
slice(1,2, 11, 12)
```
### Factor levels
\index{summarising data@\textbf{summarising data}!factor levels}
\index{functions@\textbf{functions}!fct\_relevel}
\index{functions@\textbf{functions}!levels}
`arrange()` sorts characters alphabetically, whereas factors will be sorted by the order of their levels.
Let's make the cause column into a factor:
```{r}
gbd_factored <- gbd_long %>%
mutate(cause = factor(cause))
```
When we first create a factor, its levels will be ordered alphabetically:
```{r}
gbd_factored$cause %>% levels()
```
But we can now use `fct_relevel()` inside `mutate()` to change the order of these levels:
```{r}
gbd_factored <- gbd_factored %>%
mutate(cause = cause %>%
fct_relevel("Injuries"))
gbd_factored$cause %>% levels()
```
`fct_relevel()` brings the level(s) listed in it to the front.
So if we use `arrange()` on `gbd_factored`, the `cause` column will be sorted based on the order of its levels, not alphabetically.
This is especially useful in two places:
* plotting - categorical variables that are characters will be ordered alphabetically (e.g., think barplots), regardless of whether the rows are arranged or not;
* statistical tests - the reference level of categorical variables that are characters is the alphabetically first (e.g., what the odds ratio is relative to).
However, making a character column into a factor gives us power to give its levels a non-alphabetical order, giving us control over plotting order or defining our reference levels for use in statistical tests.
## Exercises
### Exercise - `pivot_wider()`
Using the GBD dataset with variables `cause`, `year` (1990 and 2017 only), `sex` (as shown in Table \@ref(tab:chap3-tab-gbd-long)):
```{r, message = FALSE}
gbd_long <- read_csv("data/global_burden_disease_cause-year-sex.csv")
```
Use `pivot_wider()` to put the `cause` variable into columns using the `deaths_millions` as values:
```{r, echo = FALSE}
gbd_long %>%
pivot_wider(names_from = cause, values_from = deaths_millions) %>%
knitr::kable(booktabs = TRUE,
linesep = c(rep("", 3), "\\addlinespace"),
align = c("l", "c", "c", "c", "c"),
caption = "Exercise: putting the cause variable into the wide format.") %>%
kableExtra::kable_styling(latex_options = c("scale_down", "hold_position"))
```
**Solution**
```{r, message = FALSE, results = 'hide'}
gbd_long <- read_csv("data/global_burden_disease_cause-year-sex.csv")
gbd_long %>%
pivot_wider(names_from = cause, values_from = deaths_millions)
```
### Exercise - `group_by()`, `summarise()`
Read in the full GBD dataset with variables `cause`, `year`, `sex`, `income`, `deaths_millions`.
```{r, message = FALSE}
gbd_full <- read_csv("data/global_burden_disease_cause-year-sex-income.csv")
glimpse(gbd_full)
```
Year 2017 of this dataset was shown in Table \@ref(tab:chap3-tab-gbd2017), the full dataset has seven times as many observations as Table \@ref(tab:chap3-tab-gbd2017) since it includes information about multiple years: `r gbd_full$year %>% unique()`.
Investigate these code examples:
```{r}
summary_data1 <-
gbd_full %>%
group_by(year) %>%
summarise(total_per_year = sum(deaths_millions))
summary_data1
summary_data2 <-
gbd_full %>%
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions))
summary_data2
```
You should recognise that:
* `summary_data1` includes the total number of deaths per year.
* `summary_data2` includes the number of deaths per cause per year.
* `summary_data1 <-` means we are creating a new tibble called `summary_data1` and saving (`<-`) results into it. If `summary_data1` was a tibble that already existed, it would get overwritten.
* `gbd_full` is the data being sent to the `group_by()` and then `summarise()` functions.
* `group_by()` tells `summarise()` that we want aggregated results for each year.
* `summarise()` then creates a new variable called `total_per_year` that sums the deaths from each different observation (subcategory) together.
* Calling `summary_data1` on a separate line gets it printed.
* We then do something similar in `summary_data2`.
Compare the number of rows (observations) and number of columns (variables) of `gbd_full`, `summary_data1`, and `summary_data2`.
You should notice that:
* `summary_data2` has exactly 3 times as many rows (observations) as `summary_data1`. Why?
* `gbd_full` has 5 variables, whereas the summarised tibbles have 2 and 3. Which variables got dropped? How?
**Answers**
* `gbd_full` has `r nrow(gbd_full)` observations (rows),
* `summary_data1` has `r nrow(summary_data1)`,
* `summary_data2` has `r nrow(summary_data2)`.
`summary_data1` was grouped by year, therefore it includes a (summarised) value for each year in the original dataset.
`summary_data2` was grouped by year and cause (`r gbd_full$cause %>% unique()`), so it has 3 values for each year.
The columns a `summarise()` function returns are: variables listed in `group_by()` + variables created inside `summarise()` (e.g., in this case `deaths_peryear`). All others get aggregated.
### Exercise - `full_join()`, `percent()`
For each cause, calculate its percentage to total deaths in each year.
Hint: Use `full_join()` on `summary_data1` and `summary_data2`, and then use `mutate()` to add a new column called `percentage`.
Example result for a single year:
```{r, echo = FALSE}
library(scales)
full_join(summary_data1, summary_data2) %>%
mutate(percentage = percent(total_per_cause/total_per_year)) %>%
filter(year == 1990)
```
**Solution**
```{r message=FALSE, results='hide'}
library(scales)
full_join(summary_data1, summary_data2) %>%
mutate(percentage = percent(total_per_cause/total_per_year))
```
### Exercise - `mutate()`, `summarise()`
Instead of creating the two summarised tibbles and using a `full_join()`, achieve the same result as in the previous exercise with a single pipeline using `summarise()` and then `mutate()`.
Hint: you have to do it the other way around, so `group_by(year, cause) %>% summarise(...)` first, then `group_by(year) %>% mutate()`.
Bonus: `select()` columns `year`, `cause`, `percentage`, then `pivot_wider()` the `cause` variable using `percentage` as values.
**Solution**
```{r}
gbd_full %>%
# aggregate to deaths per cause per year using summarise()
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions)) %>%
# then add a column of yearly totals using mutate()
group_by(year) %>%
mutate(total_per_year = sum(total_per_cause)) %>%
# add the percentage column
mutate(percentage = percent(total_per_cause/total_per_year)) %>%
# select the final variables for better vieweing
select(year, cause, percentage) %>%
pivot_wider(names_from = cause, values_from = percentage)
```
Note that your pipelines shouldn't be much longer than this, and we often save interim results into separate tibbles for checking (like we did with `summary_data1` and `summary_data2`, making sure the number of rows are what we expect and spot checking that the calculation worked as expected).
> R doesn't do what you want it to do, it does what you ask it to do. Testing and spot checking is essential as you will make mistakes. We sure do.
Do not feel like you should be able to just bash out these clever pipelines without a lot of trial and error first.
### Exercise - `filter()`, `summarise()`, `pivot_wider()`
Still working with `gbd_full`:
* Filter for 1990.
* Calculate the total number of deaths in the different income groups (`r gbd_full$income %>% unique()`). Hint: use `group_by(income)` and `summarise(new_column_name = sum(variable))`.
* Calculate the total number of deaths within each income group for males and females. Hint: this is as easy as adding `, sex` to `group_by(income)`.
* `pivot_wider()` the `income` column.
**Solution**
```{r}
gbd_full %>%
filter(year == 1990) %>%
group_by(income, sex) %>%
summarise(total_deaths = sum(deaths_millions)) %>%
pivot_wider(names_from = income, values_from = total_deaths)
```