-
Notifications
You must be signed in to change notification settings - Fork 90
/
01_introduction.Rmd
290 lines (208 loc) · 15.5 KB
/
01_introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
# (PART) Data wrangling and visualisation {-}
# Why we love R
Thank you for choosing this book on using R for health data analysis.
Even if you're already familiar with the R language, we hope you will find some new approaches here as we make the most of the latest R tools including some we've developed ourselves.
Those already familiar with R are encouraged to still skim through the first few chapters to familiarise yourself with the style of R we recommend.
R can be used for all the health data science applications we can think of.
From bioinformatics and computational biology, to administrative data analysis and natural language processing, through internet-of-things and wearable data, to machine learning and artificial intelligence, and even public health and epidemiology.
R has it all.
Here are the main reasons we love R:
* R is versatile and powerful - use it for
- graphics;
- all the statistical tests you can dream of;
- machine learning and deep learning;
- automated reports;
- websites;
- and even books (yes, this book was written entirely in R).
* R scripts can be reused - gives you efficiency and reproducibility.
* It is free to use by anyone, anywhere.
<!-- ![R logo, © 2016 The R Foundation.](images/chapter01/Rlogo.png){width=150px} -->
![](images/chapter01/Rlogo.png){width=150px}
## Help, what's a script? {#chap01-what-script}
\index{RStudio@\textbf{RStudio}!script}
A script is a list of instructions.
It is just a text file and no special software is required to view one.
An example R script is shown in Figure \@ref(fig:chap01-fig-rscript).
**Don't panic!**
The only thing you need to understand at this point is that what you're looking at is a list of instructions written in the R language.
\index{comments}
You should also notice that some parts of the script look like normal English.
These are the lines that start with a # and they are called "comments".
We can (and should) include these comments in everything we do.
These are notes of what we were doing, both for colleagues as well as our future selves.
```{r chap01-fig-rscript, echo = FALSE, fig.cap="An example R script from RStudio.", out.width="90%"}
knitr::include_graphics("images/chapter01/example_script.png")
```
Lines that do not start with # are R code.
This is where the number crunching really happens.
We will cover the details of this R code in the next few chapters.
The purpose of this chapter is to describe some of the terminology as well as the interface and tools we use.
For the impatient:
* We interface R using RStudio
* We use the **tidyverse** packages that are a substantial extension to base R functionality (we repeat: extension, not replacement)
Even though R is a language, don't think that after reading this book you should be able to open a blank file and just start typing in R code like an evil computer genius from a movie.
This is not what real-world programming looks like.
Firstly, you should be copy-pasting and adapting existing R code examples - whether from this book, the internet, or later from your existing work.
Re-writing everything from scratch is not efficient.
Yes, you will understand and eventually remember a lot of it, but to spend time memorising specific functions that can easily be looked up and copied is simply not necessary.
Secondly, R is an interactive language.
Meaning that we "run" R code line by line and get immediate feedback.
We do not write a whole script without trying each part out as we go along.
Thirdly, do not worry about making mistakes.
Celebrate them!
The whole point of R and reproducibility is that manipulations are not applied directly on a dataset, but a copy of it.
Everything is in a script, so you can't do anything wrong.
If you make a mistake like accidentally overwriting your data, we can just reload it, rerun the steps that worked well and continue figuring out what went wrong at the end.
And since all of these steps are written down in a script, R will redo everything with a single push of a button.
You do not have to repeat a set of mouse clicks from dropdown menus as in other statistical packages, which quickly becomes a blessing.
## What is RStudio?
\index{RStudio}
RStudio is a free program that makes working with R easier.
An example screenshot of RStudio is shown in Figure \@ref(fig:chap01-fig-rstudio).
We have already introduced what is in the top-left pane - the **Script**.
```{r chap01-fig-rstudio, echo = FALSE, fig.cap = "We use RStudio to work with R.", out.width="100%"}
knitr::include_graphics("images/chapter01/rstudio_interface.png")
```
Now, look at the little **Run** and **Source** buttons at the top-right corner of the script pane.
Clicking **Run** executes a line of R code.
Clicking **Source** executes all lines of R code in the script (it is essentially 'Run all lines').
When you run R code, it gets sent to the **Console** which is the bottom-left panel.
This is where R really lives.
> Keyboard Shortcuts!
> Run line: Control+Enter
> Run all lines (Source): Control+Shift+Enter
> (On a Mac, both Control or Command work)
The Console is where R speaks to us.
When we're lucky, we get results in there - in this example the results of a *t*-test (last line of the script).
When we're less lucky, this is also where Errors or Warnings appear.
R Errors are a lot less scary than they seem!
Yes, if you're using a regular computer program where all you do is click on some buttons, then getting a proper red error that stops everything is quite unusual.
But in programming, Errors are just a way for R to communicate with us.
We see Errors in our own work every single day, they are very normal and do not mean that everything is wrong or that you should give up.
Try to re-frame the word Error to mean "feedback", as in "Hello, this is R. I can't continue, this is the feedback I am giving you."
The most common Errors you'll see are along the lines of "Error: something not found".
This almost always means there's a typo or you've misspelled something.
Furthermore, R is case sensitive so capitalisation matters (variable name `lifeExp` is not the same as `lifeexp`).
The Console can only print text, so any plots you create in your script appear in the **Plots** pane (bottom-right).
Similarly, datasets that you've loaded or created appear in the **Environment** tab.
When you click on a dataset, it pops up in a nice viewer that is fast even when there is a lot of data.
This means you can have a look and scroll through your rows and columns, the same way you would with a spreadsheet.
## Getting started
\index{installation@\textbf{installation}!tidyverse}
\index{installation@\textbf{installation}!packages}
\index{installation@\textbf{installation}!RStudio}
To start using R, you should do these two things:
* Install R (from https://www.r-project.org/)
* Install RStudio Desktop (from https://www.rstudio.com/)
When you first open up RStudio, you'll also want to install some extra packages to extend the base R functionality.
You can do this in the **Packages** tab (next to the Plots tab in the bottom-right in Figure \@ref(fig:chap01-fig-rstudio)).
A Package is just a collection of functions (commands) that are not included in the standard R installation, called base-R.
A lot of the functionality introduced in this book comes from the __tidyverse__ family of R packages (http://tidyverse.org @tidyverse2019).
So when you go to Packages, click **Install**, type in __tidyverse__, and a whole collection of useful and modern packages will be installed.
Even though you've installed the __tidyverse__ packages, you'll still need to tell R when you're about to use them.
We include `library(tidyverse)` at the top of every script we write:
```{r message=FALSE, warning=FALSE}
library(tidyverse)
```
```{r, echo = FALSE, out.width="70%"}
knitr::include_graphics("images/chapter01/tidyverse_loading_messages.png")
```
We can see that it has loaded 8 packages (__ggplot2__, __tibble__, __tidyr__, __readr__, __purrr__, __dplyr__, __stringr__, __forcats__), the number behind a package name is its version.
The "Conflicts" message is expected and can safely be ignored.^[It just means that when we use `filter` or `lag`, they will come from the __dplyr__ package, rather than the __stats__ package.
We've never needed to use `filter` and `lag` from __stats__, but if you do, then use the double colon, i.e., `stats::filter()` or `stats::lag()`, as just `filter()` will use the __dplyr__ one.]
There are a few other R packages that we use and are not part of the tidyverse, but we will introduce them as we go along.
If you're incredibly curious, head to the Resources section of the HealthyR website which is the best place to find up-to-date links and installation instructions. Our R and package versions are also listed in the Appendix.
## Getting help
\index{help}
\index{errors}
The best way to troubleshoot R errors is to copy-paste them into a search engine (e.g., Google).
Searching online is also a great way to learn how to do new specific things or to find code examples.
You should copy-paste solutions into your R script to then modify to match what you're trying to do.
We are constantly copying code from online forums and our own existing scripts.
However, there are many different ways to achieve the same thing in R.
Sometimes you'll search for help and come across R code that looks nothing like what you've seen in this book.
The __tidyverse__ packages are relatively new and use the pipe (` %>%`), something we'll come on to.
But search engines will often prioritise older results that use a more traditional approach.
So older solutions may come up at the top.
Don't get discouraged if you see R code that looks completely different to what you were expecting.
Just keep scrolling down or clicking through different answers until you find something that looks a little bit more familiar.
If you're working offline, then RStudio's built in **Help** tab is useful.
To use the Help tab, click your cursor on something in your code (e.g., `read_csv()`) and press F1.
This will show you the definition and some examples.
F1 can be hard to find on some keyboards, an alternative is to type, e.g., `?read_csv`.
This will also open the Help tab for this function.
However, the Help tab is only useful if you already know what you are looking for but can't remember exactly how it works.
For finding help on things you have not used before, it is best to Google it.
R has about 2 million users so someone somewhere has probably had the same question or problem.
RStudio also has a Help drop-down menu at the very top (same row where you find "File", "Edit", ...).
The most notable things in the Help drop-down menu are the Cheatsheets.
These tightly packed two-pagers include many of the most useful functions from `tidyverse` packages.
They are not particularly easy to learn from, but invaluable as an *aide-mémoire*.
## Work in a Project
The files on your computer are organised into folders.
RStudio Projects live in your computer's normal folders - they placemark the working directory of each analysis project.
These project folders can be viewed or moved around the same way you normally work with files and folders on your computer.
The top-right corner of your RStudio should never say **"Project: (None)"**.
If it does, click on it and create a New Project.
After clicking on New Project, you can decide whether to let RStudio create a New Directory (folder) on your computer. Alternatively, if your data files are already organised into an "Existing folder", use the latter option.
Every set of analysis you are working on must have its own folder and RStudio project.
This enables you to switch between different projects without getting the data, scripts, or output files all mixed up.
Everything gets read in or saved to the correct place.
No more exporting a plot and then going through the various Documents, etc., folders on your computer trying to figure out where your plot might have been saved to.
It got saved to the project folder.
## Restart R regularly
Have you tried turning it off and on again? It is vital to restart R regularly.
Restarting R helps to avoid accidentally using the wrong data or functions stored in the environment.
Restarting R only takes a second and we do it several times per day!
Once you get used to saving everything in a script, you'll always be happy to restart R.
This will help you develop robust and reproducible data analysis skills.
> You can restart R by clicking on `Session -> Restart R` (top menu).
```{r chap01-fig-settings, echo = FALSE, fig.cap = "Configuring your RStudio Tools -> Global Options: Untick ``Restore .RData into Workspace on Exit\" and Set ``Save .RData on exit\" to Never.", out.width="100%"}
knitr::include_graphics("images/chapter01/rstudio_settings.png")
```
<!-- There's also a Keyboard shortcut for it: `Ctrl+Shift+F10`, but considering how hard it is to use function keys on laptops, that's hardly a shortcut. -->
<!-- > We recommend changing your "Restart R" keyboard shortcut to `Control+R `or `Cmd+R` (`Cmd+Shift+R `if using RStudio Server or Cloud). -->
<!-- This can be done in `Tools -> Modify Keyboard Shortcuts`. -->
Furthermore, RStudio has a default setting that is no longer considered best practice (Figure \@ref(fig:chap01-fig-settings)).
After installing RStudio, you should go change two small but important things in `Tools -> Global Options`:
1. **Uncheck** "Restore .RData into Workspace on startup"
2. Set "Save .RData on exit" to **Never**
This does not mean you can't or shouldn't save your work in `.RData/.rda` files.
But it is best to do it consciously and load exactly what you want to load.
Letting R silently save and load everything for you may also include broken data or objects.
## Notation throughout this book
When mentioned in the text, the names of R packages are in bold font, e.g., __ggplot2__, whereas functions, objects, and variables are printed with mono-spaced font, e.g `filter()`, `mean()`, `lifeExp`. Functions are always followed by brackets: `()`, whereas data objects or variables are not.
Otherwise, R code lives in the grey areas known as 'code chunks'.
Lines of R *output* start with a double ## - this will be the numbers or text that R gives us after executing code.
R also adds a counter at the beginning of every new line; look at the numbers in the square brackets [] below:
```{r}
# colon between two numbers creates a sequence
1001:1017
```
Remember, lines of R code that start with # are called comments.
We already introduced comments as notes about the R code earlier in this chapter (Section \@ref(chap01-what-script) "Help, what's a script?"), however, there is a second use case for comments.
When you make R code a comment, by adding a # in front of it, it gets 'commented out'.
For example, let's say your R script does two things, prints numbers from 1 to 4, and then numbers from 1001 to 1004:
```{r}
# Let's print small numbers:
1:4
# Now we're printing bigger numbers:
1001:1004
```
If you decide to 'comment out' the printing of big numbers, the code will look like this:
```{r, results='hold'}
# Let's print small numbers:
1:4
# Now we're printing bigger numbers:
# 1001:1004
```
You may even want to add another real comment to explain why the latter was commented out:
```{r}
# Now commented out as not required any more
# Now we're printing bigger numbers:
# 1001:1004
```
You could of course delete the line altogether, but commenting out is useful as you might want to include the lines later by removing the # from the beginning of the line.
> Keyboard Shortcut for commenting out/commenting in multiple lines at a time:
> Control+Shift+C
> (On a Mac, both Control or Command work)