-
Notifications
You must be signed in to change notification settings - Fork 225
/
02-processing-data.Rmd
executable file
·334 lines (250 loc) · 9.97 KB
/
02-processing-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "MY472 - Week 2: Data processing in R with tidyr and dplyr"
author: "Ryan Hübert (AT 2024)"
date: "October 2024"
output: html_document
---
**Attribution statement:** _The following teaching materials have been
iteratively developed by current and former instructors, including: Friedrich
Geiecke and Ryan Hübert._
### Loading packages
```{r}
# If the package has not been installed, uncomment the line below and run it
# only once on your computer
# install.packages("tidyverse")
# Loading the package
library(tidyverse)
```
### Data
We will analyse some macroeconomic time series from the US and Europe during the
first wave of the COVID-19 crisis. The data has been downloaded in early October
2020 from FRED which is hosted by the Federal Reserve Bank of St. Louis. It
contains data for 2019 (as a reference year) and 2020 up to varying points,
because at the time of the download different series had been published up to
different months. Full series can be found at <https://fred.stlouisfed.org/>
under the following names:
__Industrial production__
- US: INDPRO
- UK: GBRPROINDMISMEI
- Germany: DEUPROINDMISMEI
- France: FRAPROINDMISMEI
- Spain: ESPPROINDMISMEI
- Italy: ITAPROINDMISMEI
__Unemployment rate__
- US: LRHUTTTTUSM156S
- UK: LRHUTTTTGBM156S
- Germany: LRHUTTTTDEM156S
- France: LRHUTTTTFRM156S
- Spain: LRHUTTTTESM156S
- Italy: LRHUTTTTITM156S
If you download the data again for newer time periods, note that past
macroeconomic data are continuously revised. Thus, also for 2019 and 2020 the
values might be slightly different now.
### Loading data
Make sure to have the data files in the same folder as this script. The data is
loaded via the `read.csv` function, this creates a data frame object:
```{r}
# Industrial production data in format one
ip_df <- read.csv("ip.csv")
# Cleaning column names
colnames(ip_df) <- str_replace(colnames(ip_df), "X", "")
head(ip_df)
```
```{r}
# Industrial production and unemployment data in format two
ip_and_unemployment_df <- read.csv("ip_and_unemployment.csv")
head(ip_and_unemployment_df)
```
### Tibbles
In the the tidyverse, data frames are often processed as so called tibbles. This
allows to depict them in a more convenient form, e.g. in the console and the
knitted R markdowns.
```{r}
# Transforming standard data frames into tibbles
ip <- as_tibble(ip_df)
ip_and_unemployment <- as_tibble(ip_and_unemployment_df)
print(ip)
print(ip_and_unemployment)
```
For further features of tibbles, see <https://r4ds.had.co.nz/tibbles.html>.
You can also import a `csv` file directly into a tibble by using the `read_csv`
function, which is available from the `readr` package in `tidyverse`.
Another useful way to view the full dataset can be the function `View()` in
RStudio:
```{r, eval=FALSE}
View(ip)
View(ip_and_unemployment)
```
### The pipe operator
A very frequently used operator for data science in R today is the pipe operator
`%>%`. It pipes an object as the first argument into a subsequent function. This
is an alternative notation to writing the object as the first input in the
function directly. Let us briefly illustrate this with a simple example, we use
the maximum function to determine the largest element in a vector:
```{r}
exemplary_vector <- c(-42, -5, 17, 24, -34, 93, 18)
# Two ways to determine the largest element
max(exemplary_vector)
exemplary_vector %>% max()
```
The pipe operator can be used in this example and yields the same outcome as the
max function applied in the traditional way. The operator becomes particularly
useful, however, if multiple operations are done in a sequence. To get a first
idea of this, we define a function which returns the negative elements in a
vector, and then compute the mean of only these values:
```{r}
# Create a function which only returns negative elements in a vector
return_negative_elements <- function(x) {
return(x[x<0])
}
# Mean of only negative values
exemplary_vector %>%
return_negative_elements() %>%
mean()
```
We could have achieved the same with
`mean(exemplary_vector[exemplary_vector<0])`
or with nested calls of functions such as
`mean(return_negative_elements(exemplary_vector))`,
but using chains of pipe operators is very readable. This becomes particularly
helpful when analysing datasets later.
### Pivoting into a longer form
In the industrial production data frame, column names are months which should be
a separate variable:
```{r}
print(ip)
```
To bring the data into the "tidy" format, these columns are transformed into a
new pair of variables:
```{r}
ip_long <- pivot_longer(ip, cols = colnames(ip)[2:length(colnames(ip))],
names_to = "date", values_to = "ip")
ip_long
```
In the lecture, we solved this problem with the pipe operator `%>%`. It yields
the same outcome:
```{r}
ip_long <- ip %>% pivot_longer(cols = colnames(ip)[2:length(colnames(ip))],
names_to = "date", values_to = "ip")
ip_long
```
### Pivoting into a wider form
Next, let us consider the data frame which contains both industrial production
and unemployment rates. In this data frame, a unit of observation is a
country-month. Values contained in the series column are actually separate
variables.
```{r}
print(ip_and_unemployment)
```
Just like in the lecture example, we use `pivot_wider()` to generate these
variables:
```{r}
ip_and_unemployment_wide <- ip_and_unemployment %>%
pivot_wider(names_from = series, values_from = value,
id_cols = c("country","date"))
ip_and_unemployment_wide
```
### Some useful functions
Next, we will discuss some commonly used functions in the tidyverse (most of
these functions are from the `dyplr` package).
#### Selecting and dropping columns
A common task is to select or drop columns. This can be done very conveniently
with the function `select`. Selecting only the date and industrial production
columns:
```{r}
ip_and_unemployment_wide %>% select(date, ip)
```
We can also easily drop specific columns, e.g. only the date column:
```{r}
ip_and_unemployment_wide %>% select(-date)
```
#### Selecting rows
Similarly, we might want to select only specific rows that meet certain
conditions. This can be achieved with the filter function, e.g. we can select
data from only the UK or from both the UK and France:
```{r}
ip_and_unemployment_wide %>% filter(country == "uk")
```
```{r}
ip_and_unemployment_wide %>% filter(country %in% c("uk", "france"))
```
Next, let us get an idea why repeated pipe operations are so useful for data
analysis and readability. Say our goal is to determine the largest month to
month contraction in industrial production for the UK during the Corona crisis
so far. This can e.g. be achieved by combining filter and select functions:
```{r}
ip_and_unemployment_wide %>%
filter(country == "uk") %>%
filter(ip == min(ip)) %>%
select(date, ip)
```
The strongest contraction in UK industrial production was in April 2020 that saw
around a 20% decline in industrial production from the previous month.
Similarly, we can determine the timing and magnitude of the largest US
unemployment rate during the sample:
```{r}
ip_and_unemployment_wide %>%
filter(country == "us") %>%
filter(unemployment == max(unemployment)) %>%
select(date, unemployment)
```
In April 2020, the US had an unemployment rate of 14.7. When we select the value
from the month before, we see that this was an extremely sharp month to month
rise:
```{r}
ip_and_unemployment_wide %>%
filter(country == "us") %>%
filter(date == "01.03.2020") %>%
select(date, unemployment)
```
#### Summary statistics
Another frequent goal is to compute summary statistics. This can be done with
the `summarise` function, the following e.g. depicts the mean and standard
deviation of UK industrial production percentage changes since January 2019:
```{r}
ip_and_unemployment_wide %>%
filter(country == "uk") %>%
select(ip) %>%
drop_na() %>%
summarise(uk_ip_mean = mean(ip), uk_ip_sd = sd(ip), uk_observations = n())
```
#### Grouping
Next, a very useful function is `group_by()` which creates groups within the
data frame:
```{r}
group_by(ip_and_unemployment_wide, country)
```
This can then be used to e.g. obtain summary statistics for each of these groups.
```{r}
ip_and_unemployment_wide %>%
group_by(country) %>%
select(country, unemployment) %>%
drop_na() %>%
summarise(unemployment_mean = mean(unemployment),
unemployment_sd = sd(unemployment), observations = n())
```
#### Creating new variables
In some cases we might also want to add transformations of variables or features
to the data frame. This can be done with the command `mutate()`. For example, we
might be interested in the percentage change of US unemployment, not just its level.
```{r}
ip_and_unemployment_wide <- ip_and_unemployment_wide %>% group_by(country) %>%
mutate(unemployment_percentage_change = (unemployment/lag(unemployment) - 1) * 100) %>%
ungroup()
ip_and_unemployment_wide
```
__Question:__ Why did we need `group_by` here?
When using such an approach with `lag()` or `lead()`, it is important that the
observations in the dataset are sorted chronologically. In the datasets here,
this should already be given, however, you might frequently encounter datasets
where it is not the case. In such datasets (and in fact generally) it is key to
transform date variables from characters into proper date formats which can be
used in operations such as sorting. This is left as an exercise here, for a
discussion see <https://r4ds.had.co.nz/dates-and-times.html>. Dates can then
e.g. be sorted like other values with the function `arrange()`.
### References
- R for Data Science by Grolemund and Wickham (<https://r4ds.had.co.nz/>)
- For a more in-depth discussion than in this file, also see Garrett Grolemund's
great video series of key tidyverse functions to process data (note that the pivot
commands are called gather and spread as these videos discuss a slightly older
version of the package) (<https://www.youtube.com/watch?v=jOd65mR1zfw&list=PL9HYL-VRX0oQOWAFoKHFQAsWAI3ImbNPk>).