-
Notifications
You must be signed in to change notification settings - Fork 90
/
02_basics.Rmd
1313 lines (939 loc) · 56.2 KB
/
02_basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: html_document
editor_options:
chunk_output_type: inline
---
# R basics
```{r setup, include = FALSE}
knitr::opts_chunk$set(fig.align = 'center')
library(tidyverse)
library(kableExtra)
mykable = function(x, caption = "CAPTION", ...){
kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"),
booktabs = TRUE, caption = caption,
linesep = "", ...)
}
```
Throughout this book, we are conscious of the balance between theory and practice.
Some learners may prefer to see all definitions laid out before being shown an example of a new concept.
Others would rather see practical examples and explanations build up to a full understanding over time.
We strike a balance between these two approaches that works well for most people in the audience.
Sometimes we will show you an example that may use words that have not been formally introduced yet.
For example, we start this chapter with data import - R is nothing without data.
In so doing, we have to use the word "argument", which is only defined two sections later (in \@ref(chap02-objects-functions) "Objects and functions").
A few similar instances arise around statistical concepts in the Data Analysis part of the book.
You will come across sentences along the lines of "this concept will become clearer in the next section".
Trust us and just go with it.
The aim of this chapter is to familiarise you with how R works.
We will read in data and start basic manipulations. You may want to skip parts of this chapter if you already:
* have found the Import Dataset interface;
* know what numbers, characters, factors, and dates look like in R;
* are familiar with the terminology around objects, functions, arguments;
* have used the pipe: `%>%`;
* know how to filter data with operators such as `==, >, <, &, |`;
* know how to handle missing data (NAs), and why they can behave weirdly in a filter;
* have used `mutate()`, `c()`, `paste()`, `if_else()`, and the joins.
## Reading data into R{#chap02-h2-reading-data-into-r}
\index{import data}
\index{reading data}
Data usually comes in the form of a table, such as a spreadsheet or database.
In the world of the __tidyverse__, a table read into R gets called a `tibble`.
A common format in which to receive data is CSV (comma separated values).
CSV is an uncomplicated spreadsheet with no formatting.
It is just a single table with rows and columns (no worksheets or formulas).
Furthermore, you don't need special software to quickly view a CSV file - a text editor will do, and that includes RStudio.
For example, look at "example_data.csv" in the healthyr project's folder in Figure \@ref(fig:chap2-fig-examplecsv) (this is the Files pane at the bottom-right corner of your RStudio).
```{r chap2-fig-examplecsv, echo = FALSE, fig.cap="View or import a data file.", out.width="70%"}
knitr::include_graphics("images/chapter02/files_csv_example.png")
```
Clicking on a data file gives us two options: "View File" or "Import Dataset".
We will show you how to use the Import Dataset interface in a bit, but for standard CSV files, we don't usually bother with the Import interface and just type in (or copy from a previous script):
\index{functions@\textbf{functions}!read\_csv}
```{r, eval=FALSE}
library(tidyverse)
example_data <- read_csv("example_data.csv")
View(example_data)
```
There are a couple of things to say about the first R code chunk of this book.
First and foremost: do not panic.
Yes, if you're used to interacting with data by double-clicking on a spreadsheet that just opens up, then the above R code does seem a bit involved.
However, running the example above also has an immediate visual effect.
As soon as you click Run (or press Ctrl+Enter/Command+Enter), the dataset immediately shows up in your Environment and opens in a Viewer.
You can have a look and scroll through the same way you would in Excel or similar.
So what's actually going on in the R code above:
* We load the __tidyverse__ packages (as covered in the first chapter of this book).
* We have a CSV file called "example_data.csv" and are using `read_csv()` to read it into R.
* We are using the assignment arrow `<-` to save it into our Environment using the same name: `example_data`.
* The `View(example_data)` line makes it pop up for us to view it. Alternatively, click on `example_data` in the Environment to achieve the exact same thing.
More about the assignment arrow (`<-`) and naming things in R are covered later in this chapter.
Do not worry if everything is not crystal clear just now.
### Import Dataset interface
In the `read_csv()` example above, we read in a file that was in a specific (but common) format.
However, if your file uses semicolons instead of commas, or commas instead of dots, or a special number for missing values (e.g., 99), or anything else weird or complicated, then we need a different approach.
RStudio's **Import Dataset** interface (Figure \@ref(fig:chap2-fig-examplecsv)) can handle all of these and more.
```{r chap02-fig-import-tool, echo = FALSE, fig.cap="Import: Some of the special settings your data file might have.", out.width="70%"}
knitr::include_graphics("images/chapter02/import_options.png")
```
```{r chap02-fig-import-code, echo = FALSE, fig.cap="After using the Import Dataset window, copy-paste the resulting code into your script.", out.width="50%"}
knitr::include_graphics("images/chapter02/code_preview.png")
```
After selecting the specific options to import a particular file, a friendly preview window will show whether R properly understands the format of your data.
DO NOT BE tempted to press the **Import** button.
Yes, this will read in your dataset once, but means you have to reselect the options every time you come back to RStudio.
Instead, copy-paste the code (e.g., Figure \@ref(fig:chap02-fig-import-code)) into your R script.
This way you can use it over and over again.
Ensuring that all steps of an analysis are recorded in scripts makes your workflow reproducible by your future self, colleagues, supervisors, and extraterrestrials.
>The `Import Dataset` button can also help you to read in Excel, SPSS, Stata, or SAS files (instead of `read_csv()`, it will give you `read_excel()`, `read_sav()`, `read_stata()`, or `read_sas()`).
If you've used R before or are using older scripts passed by colleagues, you might see `read.csv()` rather than `read_csv()`.
Note the dot rather than the underscore.
In short, `read_csv()` is faster and more predictable and in all new scripts is to be recommended.
In existing scripts that work and are tested, we do not recommend that you start replacing `read.csv()` with `read_csv()`.
For instance, `read_csv()` handles categorical variables differently ^[It does not silently convert strings to factors, i.e., it defaults to `stringsAsFactors = FALSE`. For those not familiar with the terminology here - don't worry, we will cover this in just a few sections.].
An R script written using the `read.csv()` might not work as expected any more if just replaced with `read_csv()`.
> Do not start updating and possibly breaking existing R scripts by replacing base R functions with the tidyverse equivalents we show here. Do use the modern functions in any new code you write.
### Reading in the Global Burden of Disease example dataset
In the next few chapters of this book, we will be using the Global Burden of Disease datasets.
The Global Burden of Disease Study (GBD) is the most comprehensive worldwide observational epidemiological study to date.
It describes mortality and morbidity from major diseases, injuries and risk factors to health at global, national and regional levels.
^[Global Burden of Disease Collaborative Network.
Global Burden of Disease Study 2017 (GBD 2017) Results.
Seattle, United States: Institute for Health Metrics and Evaluation (IHME), 2018.
Available from http://ghdx.healthdata.org/gbd-results-tool.]
GBD data are publicly available from the website.
Table \@ref(tab:chap2-tab-gbd) and Figure \@ref(fig:chap2-fig-gbd) show a high level version of the project data with just 3 variables: `cause`, `year`, `deaths_millions` (number of people who die of each cause every year).
Later, we will be using a longer dataset with different subgroups and we will show you how to summarise comprehensive datasets yourself.
```{r, message=F}
library(tidyverse)
gbd_short <- read_csv("data/global_burden_disease_cause-year.csv")
```
```{r chap2-tab-gbd, echo = FALSE}
gbd_short %>%
knitr::kable(booktabs = TRUE,
linesep = c("", "", "\\addlinespace"),
align = c("l", "c", "r", "l", "c", "r"),
caption = "Deaths per year from three broad disease categories (short version of the Global Burden of Disease example dataset).") %>%
kableExtra::kable_styling(latex_options = c("hold_position"),
font_size = 10)
```
```{r chap2-fig-gbd, echo = FALSE, fig.cap="Line and bar charts: Cause of death by year (GBD). Data in (B) are the same as (A) but stacked to show the total of all causes.", fig.height=6, fig.width=6}
source("1_source_theme.R")
library(patchwork)
p1 <- gbd_short %>%
ggplot(aes(x = year, y = deaths_millions, fill = cause, colour = cause)) +
geom_point() +
geom_line() +
labs(x = "Year", y = "Deaths per year (millions)") +
facet_wrap(~cause) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(0, 50), expand = c(0, 0)) +
geom_text(aes(label = (deaths_millions %>% round(0))), colour = "#525252", size = 3, vjust = -0.5)
p2 <- gbd_short %>%
ggplot(aes(x = year, y = deaths_millions, fill = cause, colour = cause)) +
geom_col() +
labs(x = "Year", y = "Deaths per year (millions)") +
theme(legend.position = "top") +
scale_y_continuous(limits = c(0, 60), expand = c(0, 0)) +
# another hardcoded tibble name here:
scale_x_continuous(breaks = unique(gbd_short$year)) +
guides(fill=guide_legend(ncol=3))
p1/p2 + plot_annotation(tag_levels = "A")
```
## Variable types and why we care{#chap02-vartypes}
\index{variable types@\textbf{variable types}}
\index{continuous data@\textbf{continuous data}!variable types}
\index{categorical data@\textbf{categorical data}!variable types}
\index{date-time@\textbf{date-time}}
There are three broad types of data:
* continuous (numbers), in R: numeric, double, or integer;
* categorical, in R: character, factor, or logical (TRUE/FALSE);
* date/time, in R: POSIXct date-time^[Portable Operating System Interface (POSIX) is a set of computing standards. There's nothing more to understand about this other than when R starts shouting "POSIXct this or POSIXlt that" at you, check your date and time variables].
Values within a column all have to be the same type, but a tibble can of course hold columns of different types.
Generally, R is good at figuring out what type of data you have (in programming, this 'figuring out' is called 'parsing').
For example, when reading in data, it will tell you what was assumed for each column:
```{r}
library(tidyverse)
typesdata <- read_csv("data/typesdata.csv")
typesdata
```
This means that a lot of the time you do not have to worry about those little `<chr>` vs `<dbl>` vs `<S3: POSIXct>` labels.
But in cases of irregular or faulty input data, or when doing a lot of calculations and modifications to your data, we need to be aware of these different types to be able to find and fix mistakes.
For example, consider a similar file as above but with some data entry issues introduced:
```{r}
typesdata_faulty <- read_csv("data/typesdata_faulty.csv")
typesdata_faulty
```
Notice that R parsed both the measurement and date variables as characters.
Measurement has been parsed as a character because of a data entry issue: the person taking the measurement couldn't decide which value to note down (maybe the scale was shifting between the two values) so they included both values and text "or" in the cell.
A numeric variable will also get parsed as a categorical variable if it contains certain typos, e.g., if entered as "3..7" instead of "3.7".
The reason R didn't automatically make sense of the date column is that it couldn't tell which is the date and which is the year: `02-Jan-17` could stand for `02-Jan-2017` as well as `2002-Jan-17`.
Therefore, while a lot of the time you do not have to worry about variable types and can just get on with your analysis, it is important to understand what the different types are to be ready to deal with them when issues arise.
>Since health datasets are generally full of categorical data, it is crucial to understand the difference between characters and factors (both are types of categorical variables in R with pros and cons).
So here we go.
### Numeric variables (continuous)
\index{variable types@\textbf{variable types}!continuous / numeric}
Numbers are straightforward to handle and don't usually cause trouble.
R usually refers to numbers as `numeric` (or `num`), but sometimes it really gets its nerd on and calls numbers `integer` or `double`.
Integers are numbers without decimal places (e.g., `1, 2, 3`), whereas `double` stands for "Double-precision floating-point" format (e.g., `1.234, 5.67890`).
It doesn't usually matter whether R is classifying your continuous data `numeric/num/double/int`, but it is good to be aware of these different terms as you will see them in R messages.
Something to note about numbers is that R doesn't usually print more than 6 decimal places, but that doesn't mean they don't exist.
For example, from the `typedata` tibble, we're taking the `measurement` column and sending it to the `mean()` function.
R then calculates the mean and tells us what it is with 6 decimal places:
```{r}
typesdata$measurement %>% mean()
```
Let's save that in a new object:
```{r}
measurement_mean <- typesdata$measurement %>% mean()
```
But when using the double equals operator to check if this is equivalent to a fixed value (you might do this when comparing to a threshold, or even another mean value), R returns `FALSE`:
```{r}
measurement_mean == 3.333333
```
Now this doesn't seem right, does it - R clearly told us just above that the mean of this variable is 3.333333 (reminder: the actual values in the measurement column are `r typesdata$measurement`).
The reason the above statement is `FALSE` is because `measurement_mean` is quietly holding more than 6 decimal places.
And it gets worse. In this example, you may recognise that repeating decimals (0.333333...) usually mean there's more of them somewhere. And you may think that rounding them down with the `round()` function would make your `==` behave as expected. Except, it's not about rounding, it's about how computers store numbers with decimals. Computers have issues with decimal numbers, and this simple example illustrates one:
```{r}
(0.10 + 0.05) == 0.15
```
This returns FALSE, meaning R does not seem to think that `0.10 + 0.05` is equal to `0.15`. This issue isn't specific to R, but to programming languages in general. For example, python also thinks that the sum of `0.10` and `0.05` does not equal `0.15`.
This is where the `near()` function comes in handy:
\index{functions@\textbf{functions}!near}
```{r}
library(tidyverse)
near(0.10+0.05, 0.15)
near(measurement_mean, 3.333333, 0.000001)
```
The first two arguments for `near()` are the numbers you are comparing; the third argument is the precision you are interested in. So if the numbers are equal within that precision, it returns `TRUE`. You can omit the third argument - the precision (in this case also known as the tolerance). If you do, `near()` will use a reasonable default tolerance value.
### Character variables
\index{variable types@\textbf{variable types}!character}
<!-- May need reference to factors / strings / regex etc. -->
*Characters* (sometimes referred to as *strings* or *character strings*) in R are letters, words, or even whole sentences (an example of this may be free text comments).
Characters are displayed in-between `""` (or `''`).
A useful function for quickly investigating categorical variables is the `count()` function:
```{r}
library(tidyverse)
typesdata %>%
count(group)
```
`count()` can accept multiple variables and will count up the number of observations in each subgroup, e.g., `mydata %>% count(var1, var2)`.
Another helpful option to count is `sort = TRUE`, which will order the result putting the highest count (`n`) to the top.
```{r}
typesdata %>%
count(group, sort = TRUE)
```
`count()`with the `sort = TRUE` option is also useful for identifying duplicate IDs or misspellings in your data.
With this example `tibble` (`typesdata`) that only has three rows, it is easy to see that the `id` column is a unique identifier whereas the `group` column is a categorical variable.
You can check everything by just eyeballing the `tibble` using the built in Viewer tab (click on the dataset in the Environment tab).
But for larger datasets, you need to know how to check and then clean data programmatically - you can't go through thousands of values checking they are all as intended without unexpected duplicates or typos.
For most variables (categorical or numeric), we recommend always plotting your data before starting analysis.
But to check for duplicates in a unique identifier, use `count()` with `sort = TRUE`:
```{r}
# all ids are unique:
typesdata %>%
count(id, sort = TRUE)
# we add in a duplicate row where id = ID3,
# then count again:
typesdata %>%
add_row(id = "ID3") %>%
count(id, sort = TRUE)
```
### Factor variables (categorical)
\index{variable types@\textbf{variable types}!categorical / factor}
*Factors* are fussy characters.
Factors are fussy because they include something called *levels*.
Levels are all the unique values a factor variable could take, e.g., like when we looked at `typesdata %>% count(group)`.
Using factors rather than just characters can be useful because:
* The values factor levels can take are fixed.
For example, once you tell R that `typesdata$group` is a factor with two levels: Control and Treatment, combining it with other datasets with different spellings or abbreviations for the same variable will generate a warning.
This can be helpful but can also be a nuisance when you really do want to add in another level to a `factor` variable.
* Levels have an order.
When running statistical tests on grouped data (e.g., Control vs Treatment, Adult vs Child) and the variable is just a character, not a factor, R will use the alphabetically first as the reference (comparison) level.
Converting a character column into a factor column enables us to define and change the order of its levels.
Level order affects many things including regression results and plots: by default, categorical variables are ordered alphabetically.
If we want a different order in say a bar plot, we need to convert to a factor and reorder before we plot it.
The plot will then order the groups correctly.
So overall, since health data is often categorical and has a reference (comparison) level, then factors are an essential way to work with these data in R.
Nevertheless, the fussiness of factors can sometimes be unhelpful or even frustrating.
A lot more about factor handling will be covered later (\@ref(chap08-h1)).
### Date/time variables
\index{variable types@\textbf{variable types}!date-time}
\index{functions@\textbf{functions}!dmy}
\index{functions@\textbf{functions}!ymd}
R is good for working with dates.
For example, it can calculate the number of days/weeks/months between two dates, or it can be used to find what future date is (e.g., "what's the date exactly 60 days from now?").
It also knows about time zones and is happy to parse dates in pretty much any format - as long as you tell R how your date is formatted (e.g., day before month, month name abbreviated, year in 2 or 4 digits, etc.).
Since R displays dates and times between quotes (`` ''), they look similar to characters.
However, it is important to know whether R has understood which of your columns contain date/time information, and which are just normal characters.
```{r, message = FALSE}
library(lubridate) # lubridate makes working with dates easier
current_datetime <- Sys.time()
current_datetime
my_datetime <- "2020-12-01 12:00"
my_datetime
```
When printed, the two objects - `current_datetime` and `my_datetime` seem to have a similar format.
But if we try to calculate the difference between these two dates, we get an error:
```{r eval=FALSE, include=TRUE}
my_datetime - current_datetime
```
```{r echo=FALSE}
print("Error in `-.POSIXt`(my_datetime, current_datetime)")
```
That's because when we assigned a value to `my_datetime`, R assumed the simpler type for it - so a character.
We can check what the type of an object or variable is using the `class()` function:
```{r}
current_datetime %>% class()
my_datetime %>% class()
```
So we need to tell R that `my_datetime` does indeed include date/time information so we can then use it in calculations:
```{r}
my_datetime_converted <- ymd_hm(my_datetime)
my_datetime_converted
```
Calculating the difference will now work:
```{r}
my_datetime_converted - current_datetime
```
Since R knows this is a difference between two date/time objects, it prints them in a nicely readable way.
Furthermore, the result has its own type; it is a "difftime".
```{r}
my_datesdiff <- my_datetime_converted - current_datetime
my_datesdiff %>% class()
```
This is useful if we want to apply this time difference on another date, e.g.:
```{r}
ymd_hm("2021-01-02 12:00") + my_datesdiff
```
But if we want to use the number of days in a normal calculation, e.g., what if a measurement increased by 560 arbitrary units during this time period.
We might want to calculate the increase per day like this:
```{r eval=FALSE, include=TRUE}
560/my_datesdiff
```
```{r echo=FALSE}
print("Error in `/.difftime`(560, my_datesdiff)")
```
Doesn't work, does it.
We need to convert `my_datesdiff` (which is a difftime value) into a numeric value by using the `as.numeric()` function:
```{r}
560/as.numeric(my_datesdiff)
```
The __lubridate__ package comes with several convenient functions for parsing dates, e.g., `ymd()`, `mdy()`, `ymd_hm()`, etc. - for a full list see [lubridate.tidyverse.org](lubridate.tidyverse.org).
However, if your date/time variable comes in an extra special format, then use the `parse_date_time()` function where the second argument specifies the format using the specifiers given in Table \@ref(tab:chap2-tab-timehelpers).
```{r chap2-tab-timehelpers, echo = FALSE}
tribble(
~Notation, ~Meaning, ~Example,
"%d", "day as number" ,"01-31",
"%m", "month as number" ,"01-12",
"%B", "month name" ,"January-December",
"%b", "abbreviated month" ,"Jan-Dec",
"%Y", "4-digit year" ,"2019",
"%y", "2-digit year" ,"19",
"%H", "hours" ,"12",
"%M", "minutes" ,"01",
"%S", "seconds" ,"59",
"%A", "weekday" ,"Monday-Sunday",
"%a", "abbreviated weekday" ,"Mon-Sun") %>%
mykable(caption = "Date/time format specifiers.") %>%
kableExtra::kable_styling(font_size=9, latex_options = "hold_position")
```
For example:
```{r}
parse_date_time("12:34 07/Jan'20", "%H:%M %d/%b'%y")
```
Furthermore, the same date/time specifiers can be used to rearrange your date and time for printing:
```{r}
Sys.time()
Sys.time() %>% format("%H:%M on %B-%d (%Y)")
```
You can even add plain text into the `format()` function, R will know to put the right date/time values where the `%` are:
```{r}
Sys.time() %>% format("Happy days, the current time is %H:%M %B-%d (%Y)!")
```
## Objects and functions {#chap02-objects-functions}
\index{objects}
\index{functions@\textbf{functions}}
There are two fundamental concepts in statistical programming that are important to get straight - objects and functions.
The most common object you will be working with is a dataset.
This is usually something with rows and columns much like the example in Table \@ref(tab:chap2-tab-examp1).
```{r chap2-tab-examp1, echo = FALSE}
# TIBBLE hardcoded again in the next chunk, make sure to change in both places!
mydata <- tibble(
id = 1:4,
sex = c("Male", "Female", "Female", "Male"),
var1 = c(4, 1, 2, 3),
var2 = c(NA, 4, 5, NA),
var3 = c(2, 1, NA, NA)
)
mydata %>%
knitr::kable(booktabs = TRUE, caption = "Example of data in columns and rows, including missing values denoted NA (Not applicable/Not available). Once this dataset has been read into R it gets called dataframe/tibble.") %>%
kableExtra::kable_styling(font_size=9)
```
\FloatBarrier
To get the small and made-up "dataset" into your Environment, copy and run this code^[`c()` stands for combine and will be introduced in more detail later in this chapter]:
```{r}
library(tidyverse)
mydata <- tibble(
id = 1:4,
sex = c("Male", "Female", "Female", "Male"),
var1 = c(4, 1, 2, 3),
var2 = c(NA, 4, 5, NA),
var3 = c(2, 1, NA, NA)
)
```
Data can live anywhere: on paper, in a spreadsheet, in an SQL database, or in your R Environment.
We usually initiate and interface with R using RStudio, but everything we talk about here (objects, functions, environment) also work when RStudio is not available, but R is.
This can be the case if you are working on a supercomputer that can only serve the R Console and not RStudio.
### `data frame/tibble`
So, regularly shaped data in rows and columns is called a table when it lives outside R, but once you read/import it into R it gets called a tibble.
If you've used R before, or get given a piece of code that uses `read.csv()` instead of `read_csv()`, you'll have come across the term `data frame`.^[`read.csv()` comes with base R, whereas `read_csv()` comes from the `readr` package within the `tidyverse`. We recommend using `read_csv()`.]
A `tibble` is the modern/__tidyverse__ version of a data frame in R.
In most cases, `data frames` and `tibbles` work interchangeably, but `tibbles` often work better.
Another great alternative to base R `data frames` are `data tables`.
In this book, and for most of our day-to-day work these days, we will use `tibbles`.
### Naming objects
When you read data into R, you want it to show up in the Environment tab.
Everything in your Environment needs to have a name.
You will likely have many objects such as tibbles going on at the same time.
Note that tibble is what the thing is, rather than its name.
This is the 'class' of an object.
To keep our code examples easy to follow, we call our example tibble `mydata`.
In a real analysis, you should give your tibbles meaningful names, e.g., `patient_data`, `lab_results`, `annual_totals`, etc.
Object names can't have spaces in it, which is why we use the underscore (`_`) to separate words.
Object names can include numbers, but they can't start with a number: so `labdata2019` works, `2019labdata` does not.
So, the tibble named `mydata` is an example of an object that can be in the Environment of your R Session:
```{r}
mydata
```
### Function and its arguments
A function is a procedure which takes some information (input), does something to it, and passes back the modified information (output).
A simple function that can be applied to numeric data is `mean()`.
R functions always have round brackets after their name.
This is for two reasons.
First, it easily differentiates them as functions - you will get used to reading them like this.
Second, and more importantly, we can put *arguments* in these brackets.
Arguments can also be thought of as input.
In data analysis, the most common input for a function is data.
For instance, we need to give `mean()` some data to average over.
It does not make sense (nor will it work) to feed `mean()` the whole tibble with multiple columns, including patient IDs and a categorical variable (`sex`).
To quickly extract a single column, we use the `$` symbol like this:
\index{symbols@\textbf{symbols}!select column \texttt{\$}}
```{r}
mydata$var1
```
You can ignore the `## [1]` at the beginning of the extracted values - this is something that becomes more useful when printing multiple lines of data as the number in the square brackets keeps count on how many values we are seeing.
We can then use `mydata$var1` as the first argument of `mean()` by putting it inside its brackets:
```{r}
mean(mydata$var1)
```
which tells us that the mean of `var1` (`r mydata$var1`) is `r mean(mydata$var1)`.
In this example, `mydata$var1` is the first and only argument to `mean()`.
But what happens if we try to calculate the average value of `var2` (`r mydata$var2`) (remember, `NA` stands for Not Applicable/Available and is used to denote missing data):
```{r}
mean(mydata$var2)
```
So why does `mean(mydata$var2)` return `NA` ("not available") rather than the mean of the values included in this column?
That is because the column includes missing values (`NAs`), and R does not want to average over `NAs` implicitly.
It is being cautious - what if you didn't know there were missing values for some patients?
If you wanted to compare the means of `var1` and `var2` without any further filtering, you would be comparing samples of different sizes.
We might expect to see an `NA` if we tried to, for example, calculate the average of `sex`.
And this is indeed the case:
```{r, error=TRUE}
mean(mydata$sex)
```
Furthermore, R also gives us a pretty clear Warning suggesting it can't compute the mean of an argument that is not numeric or logical.
The sentence actually reads pretty fun, as if R was saying it was not logical to calculate the mean of something that is not numeric.
But, R is actually saying that it is happy to calculate the mean of two types of variables: numerics or logicals, but what you have passed is neither.
If you decide to ignore the NAs and want to calculate the mean anyway, you can do so by adding this argument to `mean()`:
\index{missing values remove \texttt{na.rm}}
```{r}
mean(mydata$var2, na.rm = TRUE)
```
Adding `na.rm = TRUE` tells R that you are happy for it to calculate the mean of any existing values (but to remove - `rm` - the `NA` values).
This 'removal' excludes the NAs from the calculation, it does not affect the actual tibble (`mydata`) holding the dataset.
R is case sensitive, so `na.rm`, not `NA.rm` etc.
There is, however, no need to memorize how the arguments of functions are exactly spelled - this is what the Help tab is for (press `F1` when the cursor is on the name of the function).
Help pages are built into R, so an internet connection is not required for this.
> Make sure to separate multiple arguments with commas or R will give you an error of `Error: unexpected symbol`.
Finally, some functions do not need any arguments to work.
A good example is the `Sys.time()` which returns the current time and date.
This is useful when using R to generate and update reports automatically.
Including this means you can always be clear on when the results were last updated.
\index{functions@\textbf{functions}!Sys.time}
\index{system time}
```{r}
Sys.time()
```
### Working with objects
To save an object in our Environment we use the assignment arrow:
\index{symbols@\textbf{symbols}!assignment \texttt{<-}}
```{r}
a <- 103
```
This reads: the object `a` is assigned value `r a`.
`<-` is called "the arrow assignment operator", or "assignment arrow" for short.
> Keyboard shortcuts to insert `<-`:
> Windows: Alt-
> macOS: Option-
You know that the assignment worked when it shows up in the Environment tab.
If we now run `a` just on its own, it gets printed back to us:
```{r}
a
```
Similarly, if we run a function without assignment to an object, it gets printed but not saved in your Environment:
\index{functions@\textbf{functions}!seq}
```{r}
seq(15, 30)
```
`seq()` is a function that creates a sequence of numbers (+1 by default) between the two arguments you pass to it in its brackets.
We can assign the result of `seq(15, 30)` into an object, let's call it `example_sequence`:
```{r}
example_sequence <- seq(15, 30)
```
Doing this creates `example_sequence` in our Environment, but it does not print it.
To get it printed, run it on a separate line like this:
```{r}
example_sequence
```
> If you save the results of an R function in an object, it does not get printed.
> If you run a function without the assignment (`<-`), its results get printed, but not saved as an object.
Finally, R doesn't mind overwriting an existing object, for example:
```{r}
example_sequence <- example_sequence/2
example_sequence
```
Notice how we then include the variable on a new line to get it printed as well as overwritten.
### `<-` and `=`
Note that many people use `=` instead of `<-`.
Both `<-` and `=` can save what is on the right into an object with named on the left.
Although `<-` and `=` are interchangeable when saving an object into your Environment, they are not interchangeable when used as function argument.
For example, remember how we used the `na.rm` argument in the `mean()` function, and the result got printed immediately?
If we want to save the result into an object, we'll do this, where `mean_result` could be any name you choose:
```{r}
mean_result <- mean(mydata$var2, na.rm = TRUE)
```
Note how the example above uses both operators: the assignment arrow for saving the result to the Environment, the `=` equals operator for setting an argument in the `mean()` function (`na.rm = TRUE`).
### Recap: object, function, input, argument
* To summarise, objects and functions work hand in hand.
Objects are both an input as well as the output of a function (what the function returns).
* When passing data to a function, it is usually the first argument, with further arguments used to specify behaviour.
* When we say "the function returns", we are referring to its output (or an Error if it's one of those days).
* The returned object can be different to its input object.
In our `mean()` examples above, the input object was a column (`mydata$var1`: `r mydata$var1`), whereas the output was a single value: `r mean(mydata$var1)`.
* If you've written a line of code that doesn't include the assignment arrow (`<-`), its results would get printed.
If you use the assignment arrow, an object holding the results will get saved into the Environment.
## Pipe - `%>%`
\index{symbols@\textbf{symbols}!pipe \texttt{\%>\%}}
\index{pipe@\textbf{pipe}}
The pipe - denoted `%>%` - is probably the oddest looking thing you'll see in this book.
But please bear with us; it is not as scary as it looks!
Furthermore, it is super useful.
We use the pipe to send objects into functions.
In the above examples, we calculated the mean of column `var1` from `mydata` by `mean(mydata$var1)`. With the pipe, we can rewrite this as:
```{r}
library(tidyverse)
mydata$var1 %>% mean()
```
Which reads: "Working with `mydata`, we select a single column called `var1` (with the `$`) **and then** calculate the `mean()`."
The pipe becomes especially useful once the analysis includes multiple steps applied one after another.
A good way to read and think of the pipe is "and then".
<!-- I think it would help to have an example of a longer piped function vs nested brackets to show why the pipe is needed. It may not be that clear to beginners what this is all about. -->
This piping business is not standard R functionality and before using it in a script, you need to tell R this is what you will be doing.
The pipe comes from the `magrittr` package (Figure \@ref(fig:chap2-fig-pipe)), but loading the __tidyverse__ will also load the pipe.
So `library(tidyverse)` initialises everything you need.
>To insert a pipe `%>%`, use the keyboard shortcut `Ctrl+Shift+M`.
With or without the pipe, the general rule "if the result gets printed it doesn't get saved" still applies.
To save the result of the function into a new object (so it shows up in the Environment), you need to add the name of the new object with the assignment arrow (`<-`):
```{r}
mean_result <- mydata$var1 %>% mean()
```
```{r chap2-fig-pipe, out.width="70%", echo = FALSE, fig.cap="This is not a pipe. René Magritte inspired artwork, by Stefan Milton Bache."}
knitr::include_graphics("images/chapter02/magrittr.png")
```
### Using . to direct the pipe
By default, the pipe sends data to the beginning of the function brackets (as most of the functions we use expect data as the first argument).
So `mydata %>% lm(dependent~explanatory)` is equivalent to `lm(mydata, dependent~explanatory)`.
`lm()` - linear model - will be introduced in detail in Chapter \@ref(chap07-h1).
However, the `lm()` function does not expect data as its first argument.
`lm()` wants us to specify the variables first (`dependent~explanatory`), and then wants the tibble these columns are in.
So we have to use the `.` to tell the pipe to send the data to the second argument of `lm()`, not the first, e.g.,
```{r, eval = FALSE}
mydata %>%
lm(var1~var2, data = .)
```
## Operators for filtering data
\index{operators}
\index{functions@\textbf{functions}!filter}
\index{symbols@\textbf{symbols}!less than \texttt{<}}
\index{symbols@\textbf{symbols}!greater than \texttt{>}}
\index{symbols@\textbf{symbols}!less or equal \texttt{<=}}
\index{symbols@\textbf{symbols}!greater or equal \texttt{>=}}
\index{symbols@\textbf{symbols}!equal \texttt{=}}
\index{symbols@\textbf{symbols}!not \texttt{"!}}
\index{symbols@\textbf{symbols}!AND \texttt{\&}}
\index{symbols@\textbf{symbols}!OR \texttt{"|}}
Operators are symbols that tell R how to handle different pieces of data or objects.
We have already introduced three: `$` (selects a column), `<-` (assigns values or results to a variable), and the pipe - `%>%` (sends data into a function).
Other common operators are the ones we use for filtering data - these are arithmetic comparison and logical operators.
This may be for creating subgroups, or for excluding outliers or incomplete cases.
The comparison operators that work with numeric data are relatively straightforward: `>, <, >=, <=`.
The first two check whether your values are greater or less than another value, the last two check for "greater than or equal to" and "less than or equal to".
These operators are most commonly spotted inside the `filter()` function:
```{r}
gbd_short %>%
filter(year < 1995)
```
Here we send the data (`gbd_short`) to the `filter()` and ask it to retain all years that are less than 1995.
The resulting tibble only includes the year 1990.
Now, if we use the `<=` (less than or equal to) operator, both 1990 and 1995 pass the filter:
```{r}
gbd_short %>%
filter(year <= 1995)
```
Furthermore, the values either side of the operator could both be variables, e.g., `mydata %>% filter(var2 > var1)`.
To filter for values that are equal to something, we use the `==` operator.
```{r}
gbd_short %>%
filter(year == 1995)
```
This reads, take the GBD dataset, send it to the filter and keep rows where year is equal to 1995.
Accidentally using the single equals `=` when double equals is necessary `==` is a common mistake and still happens to the best of us.
It happens so often that the error the `filter()` function gives when using the wrong one also reminds us what the correct one was:
```{r, error = TRUE}
gbd_short %>%
filter(year = 1995)
```
> The answer to "do you need ==?" is almost always, "Yes R, I do, thank you".
But that's just because `filter()` is a clever cookie and is used to this common mistake.
There are other useful functions we use these operators in, but they don't always know to tell us that we've just confused `=` for `==`.
So if you get an error when checking for an equality between variables, always check your `==` operators first.
R also has two operators for combining multiple comparisons: & and |, which stand for AND and OR, respectively.
For example, we can filter to only keep the earliest and latest years in the dataset:
```{r}
gbd_short %>%
filter(year == 1995 | year == 2017)
```
This reads: take the GBD dataset, send it to the filter and keep rows where year is equal to 1995 OR year is equal to 2017.
Using specific values like we've done here (1995/2017) is called "hard-coding", which is fine if we know for sure that we will not want to use the same script on an updated dataset.
But a cleverer way of achieving the same thing is to use the `min()` and `max()` functions:
```{r}
gbd_short %>%
filter(year == max(year) | year == min(year))
```
```{r chap2-tab-filtering-operators, echo = FALSE}
Operators = c("==", "!=" ,"<", ">", "<=", ">=", "&", "|")
Meaning = c("Equal to", "Not equal to", "Less than", "Greater than", "Less than or equal to", "Greater then or equal to", "AND", "OR")
testdata = tibble(Operators, Meaning)
testdata %>%
knitr::kable(booktabs = TRUE,
linesep = c(rep("", 6), "\\addlinespace"),
align = c("l", "c", "r"),
caption = "Filtering operators.") %>%
kableExtra::kable_styling(font_size=8)
```
### Worked examples
Filter the dataset to only include the year 2000.
Save this in a new variable using the assignment operator.
```{r, echo = TRUE}
mydata_year2000 <- gbd_short %>%
filter(year == 2000)
```
Let's practice combining multiple selections together.
Reminder: '|' means OR and '&' means AND.
From `gbd_short`, select the lines where year is either 1990 or 2017 and cause is "Communicable diseases":
```{r}
new_data_selection <- gbd_short %>%
filter((year == 1990 | year == 2013) & cause == "Communicable diseases")
# Or we can get rid of the extra brackets around the years
# by moving cause into a new filter on a new line:
new_data_selection <- gbd_short %>%
filter(year == 1990 | year == 2013) %>%
filter(cause == "Communicable diseases")
# Or even better, we can include both in one filter() call, as all
# separate conditions are by default joined with "&":
new_data_selection <- gbd_short %>%
filter(year == 1990 | year == 2013,
cause == "Communicable diseases")
```
\index{symbols@\textbf{symbols}!comment \texttt{\#}}
The hash symbol (`#`) is used to add free text comments to R code.
R will not try to run these lines, they will be ignored.
Comments are an essential part of any programming code and these are "Dear Diary" notes to your future self.
## The combine function: `c()`
\index{functions@\textbf{functions}!c() combine}
The combine function as its name implies is used to combine several values.
It is especially useful when used with the `%in%` operator to filter for multiple values.
Remember how the gbd_short cause column had three different causes in it:
```{r}
gbd_short$cause %>% unique()
```
Say we wanted to filter for communicable and non-communicable diseases.
^[In this example, it would just be easier to used the "not equal" operator, filter(cause `!=` "Injuries"), but imagine your column had more than just three different values in it.] We could use the OR operator `|` like this:
```{r}
gbd_short %>%
# also filtering for a single year to keep the result concise
filter(year == 1990) %>%
filter(cause == "Communicable diseases" | cause == "Non-communicable diseases")
```
But that means we have to type in `cause` twice (and more if we had other values we wanted to include).
This is where the `%in%` operator together with the `c()` function come in handy:
```{r}
gbd_short %>%
filter(year == 1990) %>%
filter(cause %in% c("Communicable diseases", "Non-communicable diseases"))
```
## Missing values (NAs) and filters
\index{missing values}
Filtering for missing values (NAs) needs special attention and care.
Remember the small example tibble from Table \@ref(tab:chap2-tab-examp1) - it has some NAs in columns `var2` and `var3`:
```{r}
mydata
```
If we now want to filter for rows where `var2` is missing, `filter(var2 == NA)` is not the way to do it, it will not work.
Since R is a programming language, it can be a bit stubborn with things like these.
When you ask R to do a comparison using `==` (or `<`, `>`, etc.) it expects a value on each side, but NA is not a value, it is the lack thereof.
The way to filter for missing values is using the `is.na()` function:
```{r}
mydata %>%
filter(is.na(var2))
```
We send `mydata` to the filter and keep rows where `var2` is `NA`.
Note the double brackets at the end: that's because the inner one belongs to `is.na()`, and the outer one to `filter()`.
Missing out a closing bracket is also a common source of errors, and still happens to the best of us.
If filtering for rows where `var2` is not missing, we do this^[In this simple example, `mydata %>% filter(! is.na(var2))` could be replaced by a shorthand: `mydata %>% drop_na(var2)`, but it is important to understand how the ! and `is.na()` work as there will be more complex situations where using these is necessary.]
```{r}
mydata %>%
filter(!is.na(var2))
```
In R, the exclamation mark (!) means "not".
Sometimes you want to drop a specific value (e.g., an outlier) from the dataset like this.
The small example tibble `mydata` has 4 rows, with the values for `var2` as follows: `r mydata$var2`.
We can exclude the row where `var2` is equal to 5 by using the "not equals" (`!=`)^[ `filter(var2 != 5) is equivalent to filter(! var2 == 5)`]:
```{r}
mydata %>%
filter(var2 != 5)
```
However, you'll see that by doing this, R drops the rows where `var2` is NA as well, as it can't be sure these missing values were not equal to 5.
If you want to keep the missing values, you need to make use of the OR (`|`) operator and the `is.na()` function: