-
Notifications
You must be signed in to change notification settings - Fork 90
/
12_notebooks.Rmd
419 lines (316 loc) · 17.3 KB
/
12_notebooks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
# Notebooks and Markdown{#chap12-h1}
\index{notebooks@\textbf{notebooks}}
\index{markdown@\textbf{markdown}}
> You ask me if I keep a notebook to record my great ideas. I've only ever had one.
> Albert Einstein
## What is a Notebook?
R is all-powerful for the manipulation, visualisation and analysis of data.
What is often under-appreciated is the flexibility with which analyses can be exported or reported.
For instance, a full scientific paper, industry report, or monthly update can be easily written to accommodate a varying underlying dataset, and all tables and plots will be updated automatically.
This idea can be extended to a workflow in which all analyses are performed primarily within a document which doubles as the final report.
Enter "Data Notebooks"!
Notebooks are documents which combine code and rich text elements, such as headings, paragraphs and links, in one document.
They combine analysis and reporting in one human-readable document to provide an intuitive interface between the researcher and their analysis (Figure \@ref(fig:chap12-fig-literate)).
This is sometimes called "literate programming", given the resulting logical structure of information can be easily read in the manner a human would read a book.
```{r chap12-fig-literate, echo = FALSE, fig.cap="Traditional versus literate programming using Notebooks."}
knitr::include_graphics("images/chapter12/1_literate_programming.png", auto_pdf = TRUE)
```
In our own work, we have now moved to doing most of our analyses in a Notebook file, rather than using a "script" file. You may not have guessed, but this whole book is written in this way.
Some of the advantages of the Notebook interface are:
* code and output are adjacent to each other, so you are not constantly switching between "panes";
* easier to work on smaller screen, e.g., laptop;
* documentation and reporting can be done beside the code, text elements can be fully formatted;
* the code itself can be outputted or hidden;
* the code is not limited to R - you can use Python, SQL etc.;
* facilitate collaboration by easily sharing human-readable analysis documents;
* can be outputted in a number of formats including HTML (web page), PDF, and Microsoft Word;
* output can be extended to other formats such as presentations;
* training/learning may be easier as course materials, examples, and student notes are all in the same document.
## What is Markdown?
Markdown is a lightweight language that can be used to write fully-formatted documents.
It is plain-text and uses a simple set of rules to produce rather sophisticated output - we love it!
It is easy to format headings, bold text, italics, etc.
Within RStudio there is a Quick Reference guide (Figure \@ref(fig:chap12-fig-help)) and links to the [RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets) can be found in the Help drop-down menu.
```{r chap12-fig-help, echo = FALSE, fig.cap="RStudio Markdown quick reference guide."}
knitr::include_graphics("images/chapter12/3_help.png", auto_pdf = TRUE)
```
Markdown exists independent of R and is used by a range of techies and alike.
A combination of Markdown (which is text with special characters to indicate desired formatting) and R code within it (usually to produce tables and plots) is called R Markdown.
R scripts have the file extension `.R`, Markdown documents have a file extension `.md`, therefore, R Markdown documents are `.Rmd`.
## What is the difference between a Notebook and an R Markdown file?
Most people use the terms R Notebook and R Markdown interchangeably and that is fine.
Technically, R Markdown is a file, whereas R Notebook is a way to work with R Markdown files.
R Notebooks do not have their own file format, they all use `.Rmd`.
All R Notebooks can be 'knitted' to R Markdown outputs, and all R Markdown documents can be interfaced as a Notebook.
An important difference is in the execution of code.
In R Markdown, when the file is `Knit`, all the elements (chunks) are also run.
Knit is to R Markdown what Source is to an R script (Source was introduced in Chapter 1, essentially it means 'Run all lines').
In a Notebook, when the file is rendered with the `Preview` button, no code is re-run, only that which has already been run and is present in the document is included in the output.
Also, in the Notebook behind-the-scenes file (`.nb`), all the code is always included.
Something to watch out for if your code contains sensitive information, such as a password (which it never should!).
## Notebook vs HTML vs PDF vs Word
\index{Microsoft Word}
\index{PDF}
\index{HTML}
In RStudio, a Notebook can be created by going to:
File -> New File -> R Notebook
Alternatively, you can create a Markdown file using:
File -> New File -> R Markdown...
Don't worry which you choose.
As mentioned above, they are essentially the same thing but just come with different options.
It is easy to switch from a Notebook to a Markdown file if you wish to create a PDF or Word document for instance.
If you are primarily doing analysis in the Notebook environment, choose Notebook.
If you are primarily creating a PDF or Word document, choose R Markdown file.
## The anatomy of a Notebook / R Markdown file
When you create a file, a helpful template is provided to get you started.
Figure \@ref(fig:chap12-fig-anatomy) shows the essential elements of a Notebook file and how these translate to the `HTML` preview.
### YAML header
\index{YAML header}
Every Notebook and Markdown file requires a "YAML header".
Where do they get these terms you ask?
Originally YAML was said to mean Yet Another Markup Language, referencing its purpose as a markup language.
It was later repurposed as YAML Ain't Markup Language, a recursive acronym, to distinguish its purpose as data-oriented rather than document markup (thank you Wikipedia).
This is simply where many of the settings/options for file creation are placed.
In RStudio, these often update automatically as different settings are invoked in the Options menu.
```{r chap12-fig-anatomy, echo = FALSE, fig.cap="The Anatomy of a Notebook/Markdown file. Input (left) and output (right)."}
# rotated_png is actuyally upright so that is shows up correct in the HTML
# needs to have the same file name though so that auto_pdf picks up the right file
knitr::include_graphics("images/chapter12/2_anatomy_rotated.png", auto_pdf = TRUE)
```
### R code chunks
\index{chunks}
R code within a Notebook or Markdown file can be included in two ways:
* in-line: e.g., the total number of oranges was `` `r `` `` sum(fruit$oranges)` ``;
* as a "chunk".
R chunks are flexible, come with lots of options, and you will soon get into the way of using them.
Figure \@ref(fig:chap12-fig-anatomy) shows how a chunk fits into the document.
```` markdown
`r ''````{r}
# This is basic chunk.
# It always starts with ```{r}
# And ends with ```
# Code goes here
sum(fruit$oranges)
```
````
This may look off-putting, but just go with it for now.
You can type the four back-ticks in manually, or use the `Insert` button and choose `R`.
You will also notice that chunks are not limited to R code.
It is particularly helpful that Python can also be run in this way.
When doing an analysis in a Notebook you will almost always want to see the code and the output.
When you are creating a final document you may wish to hide code.
Chunk behaviour can be controlled via the `Chunk Cog` on the right of the chunk (Figure \@ref(fig:chap12-fig-anatomy)).
Table \@ref(tab:chap12-tab-chunk-output) shows the various permutations of code and output options that are available.
The code is placed in the chunk header but the options fly-out now does this automatically, e.g.,
```` markdown
`r ''````{r, echo=FALSE}
```
````
```{r chap12-tab-chunk-output, echo=FALSE, message=FALSE}
library(dplyr)
data.frame(
Option = c(
"Show output only",
"Show code and output",
"Show code (don't run code)",
"Show nothing (run code)",
"Show nothing (don't run code)",
"",
"Hide warnings",
"Hide messages"
),
Code = c(
"echo=FALSE",
"echo=TRUE",
"eval=FALSE",
"include=FALSE",
"include=FALSE, eval=FALSE",
"",
"warnings=FALSE",
"messages=FALSE"
)
) %>%
knitr::kable(caption = "Chunk output options when knitting an R Markdown file. When using the Chunk Cog, RStudio will add these options appropriately; there is no need to memorise them.",
booktabs = TRUE)
```
### Setting default chunk options
We can set default options for all our chunks at the top of our document by adding and editing `knitr::opts_chunk$set(echo = TRUE)` at the top of the document.
```` markdown
`r ''````{r}
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE)
```
````
### Setting default figure options
It is possible to set different default sizes for different output types by including these in the YAML header (or using the document cog):
```` markdown
---
title: "R Notebook"
output:
pdf_document:
fig_height: 3
fig_width: 4
html_document:
fig_height: 6
fig_width: 9
---
````
The YAML header is very sensitive to the spaces/tabs, so make sure these are correct.
### Markdown elements
Markdown text can be included as you wish around your chunks.
Figure \@ref(fig:chap12-fig-anatomy) shows an example of how this can be done.
This is a great way of getting into the habit of explicitly documenting your analysis.
When you come back to a file in 6 months' time, all of your thinking is there in front of you, rather than having to work out what on Earth you were on about from a collection of random code!
## Interface and outputting
### Running code and chunks, knitting
\index{knitr}
Figure \@ref(fig:chap12-fig-options) shows the various controls for running chunks and producing an output document.
Code can be run line-by-line using `Ctrl+Enter` as you are used to.
There are options for running all the chunks above the current chunk you are working on.
This is useful as a chunk you are working on will often rely on objects created in preceding chunks.
```{r chap12-fig-options, echo = FALSE, fig.cap="Chunk and document options in Notebook/Markdown files."}
# rotated_png is actuyally upright so that is shows up correct in the HTML
# needs to have the same file name though so that auto_pdf picks up the right file
knitr::include_graphics("images/chapter12/4_notebook_options_rotated.png", auto_pdf = TRUE)
```
It is good practice to use the `Restart R and Run All Chunks` option in the `Run` menu every so often.
This ensures that all the code in your document is self-contained and is not relying on an object in the environment which you have created elsewhere.
If this was the case, it will fail when rendering a Markdown document.
Probably the most important engine behind the RStudio Notebooks functionality is the **knitr** package by Yihui Xie.
Not knitting like your granny does, but rendering a Markdown document into an output file, such as HTML, PDF or Word.
There are many options which can be applied in order to achieve the desired output.
Some of these have been specifically coded into RStudio (Figure \@ref(fig:chap12-fig-options)).
PDF document creation requires a `LaTeX` distribution to be installed on your computer.
Depending on what system you are using, this may be setup already.
An easy way to do this is using the **tinytex** package.
```` markdown
`r ''````{r}
install.packages("tinytex")
# Restart R, then run
tinytex::install_tinytex()
```
````
In the next chapter we will focus on the details of producing a polished final document.
## File structure and workflow
\index{file structure@\textbf{file structure}}
\index{workflow@\textbf{workflow}}
As projects get bigger, it is important that they are well organised.
This will avoid errors and make collaboration easier.
What is absolutely compulsory is that your analysis must reside within an RStudio Project and have a meaningful name (not MyProject! or Analysis1).
Creating a New Project on RStudio will automatically create a new folder for itself (unless you choose "Existing Folder").
Never work within a generic Home or Documents directory.
Furthermore, do not change the working directory using `setwd()` - there is no reason to do this, and it usually makes your analysis less reproducible.
Once you're starting to get the hang of R, you should initiate all Projects with a Git repository for version control (see Chapter \@ref(chap13-h1)).
For smaller projects with 1-2 data files, a couple of scripts and an R Markdown document, it is fine to keep them all in the Project folder (but we repeat, each Project must have its own folder).
Once the number of files grows beyond that, you should add separate folders for different types of files.
Here is our suggested approach.
Based on the nature of your analyses, the number of folders may be smaller or greater than this, and they may be called something different.
```
proj/
- scripts/
- data_raw/
- data_processed/
- figures/
- 00_analysis.Rmd
```
`scripts/` contains all the `.R` script files used for data cleaning/preparation. If you only have a few scripts, it's fine to not have this one and just keep the `.R` files in the project folder (where `00_analysis.Rmd` is in the above example).
`data_raw/` contains all raw data, such as `.csv` files, `data_processed/` contains data you've taken from raw, cleaned, modified, joined or otherwise changed using R scripts.
`figures/` may contain plots (e.g., `.png`, `.jpg`, `.pdf`)
`00_analysis.Rmd` or `00_analysis.R` is the actual main working file, and we keep this in the main project directory.
Your R scripts should be numbered using double digits, and they should have meaningful names, for example:
```
scripts/00_source_all.R
scripts/01_read_data.R
scripts/02_make_factors.R
scripts/03_duplicate_records.R
```
For instance, `01_read_data.R` may look like this.
``` r
# Melanoma project
## Data pull
# Get data
library(readr)
melanoma <- read_csv(
here::here("data_raw", "melanoma.csv")
)
# Other basic reccoding or renaming functions here
# Save
save(melanoma, file =
here::here("data_processed", "melanoma_working.rda")
)
```
Note the use of `here::here()`.
RStudio projects manage working directories in a better way than `setwd()`.
`here::here()` is useful when sharing projects between Linux, Mac and Windows machines, which have different conventions for file paths.
For instance, on a Mac you would otherwise do `read_csv("data/melanoma.csv")` and on Windows you would have to do `read_csv("data\melanoma.csv")`.
Having to include either `/` (GNU/Linux, macOS) or `\` (Windows) in your script means it will have to be changed by hand when running on a different system.
What `here::here("data_raw", "melanoma.csv")`, however, works on any system, as it will use an appropriate one 'behind the scenes' without you having to change anything.
`02_make_factors.R` is our example second file, but it could be anything you want.
It could look something like this.
``` r
# Melanoma project
## Create factors
library(tidyverse)
load(
here::here("data_processed", "melanoma_working.rda")
)
## Recode variables
melanoma <- melanoma %>%
mutate(
sex = factor(sex) %>%
fct_recode("Male" = "1",
"Female" = "0")
)
# Save
save(melanoma, file =
here::here("data", "melanoma_working.rda")
)
```
All these files can then be brought together in a single file to `source()`.
This function is used to run code from a file.
`00_source_all.R` might look like this:
``` r
# Melanoma project
## Source all
source( here::here("scripts", "01_data_upload.R") )
source( here::here("scripts", "02_make_factors.R") )
source( here::here("scripts", "03_duplicate_records.R") )
# Save
save(melanoma, file =
here::here("data_processed", "melanoma_final.rda")
)
```
You can now bring your robustly prepared data into your analysis file, which can be `.R` or `.Rmd` if you are working in a Notebook.
We call this `00_analysis.Rmd` and it always sits in the project root director.
You have two options in bringing in the data.
1. `source("00_source_all.R")` to re-load and process the data again
+ this is useful if the data is changing
+ may take a long time if it is a large dataset with lots of manipulations
2. `load("melanoma_final.rda")` from the `data_processed/` folder
+ usually quicker, but loads the dataset which was created the last time you ran `00_source_all.R`
> Remember: For `.R` files use `source()`, for `.rda` files use `load()`.
The two options look like this:
```` markdown
---
title: "Melanoma analysis"
output: html_notebook
---
`r ''````{r get-data-option-1, echo=FALSE}
load(
here:here("data", "melanoma_all.rda")
)
```
`r ''````{r get-data-option-2, echo=FALSE}
source(
here:here("R", "00_source_all.R")
)
````
### Why go to all this bother?
It comes from many years of finding errors due to badly organised projects.
It is not needed for a small quick project, but is essential for any major work.
At the very start of an analysis (as in the first day), we will start working in a single file.
We will quickly move chunks of data cleaning / preparation code into separate files as we go.
Compartmentalising the data cleaning helps with finding and dealing with errors ('debugging').
Sourced files can be 'commented out' (adding a # to a line in the `00_source_all.R` file) if you wish to exclude the manipulations in that particular file.
Most important, it helps with collaboration.
When multiple people are working on a project, it is essential that communication is good and everybody is working to the same overall plan.