-
Notifications
You must be signed in to change notification settings - Fork 225
/
03-exercises-tabular-data-solution.Rmd
executable file
·252 lines (197 loc) · 8.62 KB
/
03-exercises-tabular-data-solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "MY472 - Week 2: Seminar exercises in tabular data"
author: "Ryan Hübert (AT 2024)"
date: "October 2024"
output: html_document
---
**Attribution statement:** _The following teaching materials have been
iteratively developed by current and former instructors, including: Martin
Lukac, Patrick Gildersleve, and Ryan Hübert._
In these exercises, you will use the `tidyverse` package to work with the
dataset called `ip_and_unemployment.csv` that we used in lectures.
Start by creating a new GitHub repository called `seminar02`, which you should
make private and include a README and a `.gitignore` file with the `R` template.
Using the command line (Terminal on a Mac or Git Bash on a Windows PC), clone
your repository to your computer, putting it inside the folder where you would
like to store your local copies of your repositories for this course.
Then, navigate into your newly cloned repository using the `cd` command.
Note: In the below, you should replace `<path>` with the path to whichever
folder you would like to store your local copy of your repo. You should replace
`<url>` with the HTTPS link to your GitHub repo.
```{bash, eval = FALSE}
cd <path>
git clone <url>
cd seminar02
```
Now, save this Rmd file, close it, and move it into your local copy of the
GitHub repo you just created and cloned. After you have done this, reopen this
file in RStudio. Also store copies of the two `csv` datasets in your repo folder.
Next, stage all the files you just added to your local repo. Note: the dot tells
git to stage all the files in the folder, not just a single file. The dot can be
dangerous because you may unintentionally stage files you do not want to have on
GitHub. Before staging all the files in the repo folder with the dot, check
what's in the folder by running the command `ls` in the terminal and check you
are only staging files you want to save to GitHub. Once you're satisfied stage
as follows.
```{bash, eval = FALSE}
git add .
```
Commit and push your changes to your repo, using "added seminar files" as your
commit message.
```{bash, eval = FALSE}
git commit -m "added seminar files"
git push
```
Check on the GitHub website that this file now appears in your repository.
Now, let's get to work with R. Start with setting up your workspace:
```{r setup}
# load in libraries:
suppressMessages(library(tidyverse))
# or just library(tidyverse)
# read ip_and_unemployment.csv
ip_and_unemployment <- read.csv("ip_and_unemployment.csv")
head(ip_and_unemployment)
```
What are the highest unemployment rates for France and Spain during the time of
the sample? What are the lowest values for industrial production for the two
countries? Make sure to delete NA values in only the time series of interest.
(_Optional_: can you create a function that would do this for any country?)
```{r q1}
# Q1 --------------------------------------------------------------------------
# data processing
ipu_clean <- ip_and_unemployment %>%
pivot_wider(id_cols = c("country", "date"),
names_from = "series", values_from = "value") # long to wide
# France
ipu_clean %>%
filter(country == "france") %>%
filter(unemployment == max(unemployment, na.rm = TRUE) |
ip == min(ip, na.rm = TRUE))
# need to add na.rm = TRUE, because ip contains NA and min() will return NA
# if there is one NA, unless na.rm = TRUE
# Spain
ipu_clean %>%
filter(country == "spain") %>%
filter(unemployment == max(unemployment, na.rm = TRUE) |
ip == min(ip, na.rm = TRUE))
# Optional --------------------------------------------------------------------
filter_worst_months <- function(x) {
filtered <- ipu_clean %>%
filter(country == x) %>%
filter(unemployment == max(unemployment, na.rm = TRUE) |
ip == min(ip, na.rm = TRUE))
return(filtered)
}
filter_worst_months("germany")
```
-----
How many non-NA monthly observations of industrial production exist for the
countries here. Can you determine this with the group_by and summarise
functions? (_Optional_: can you calculate the % of values that are non-NA?)
```{r q2}
# Q2 --------------------------------------------------------------------------
# Non-NA group and summarise
ipu_clean %>%
group_by(country) %>%
summarise(nonNA_ip = sum(!is.na(ip)),
nonNA_ue = sum(!is.na(unemployment)))
# Optional --------------------------------------------------------------------
ipu_clean %>%
group_by(country) %>%
summarise(nonNA_ip = sum(!is.na(ip)),
nonNA_ue = sum(!is.na(unemployment)),
nonNA_ip_pct = nonNA_ip / length(ip),
nonNA_ue_pct = nonNA_ue / length(unemployment))
```
-----
In data science and machine learning, it can sometimes increase the predictive
power of models to add transformations of existing variables. This is usually
done in the modelling step, but to practice using the `mutate` function, let's do
it here.
However, before you do this, save your changes so far and make sure both your
local git repository and the remote copy on GitHub reflects your work so far.
When you commit your changes, use "finished first half" as your commit message.
```{bash, eval = FALSE}
git add 03-exercises-tabular-data.Rmd
git commit -m "finished first half"
git push
```
Before you do the final analysis, create a branch of your git repository that
you will use to complete the final analysis. Call this branch `final` to capture
the idea that you are using it for your final analysis. Switch to the final
branch and double check you are on the final branch.
```{bash, eval = FALSE}
git branch final
git checkout final
git branch
```
(After you finish your final analysis below, then at the very end, you will
merge this new branch back into the `main` branch.)
Back to the R code. Add three new columns to the dataframe:
1. the square of the industrial production percentage change,
2. the natural logarithm of the unemployment rate, and
3. the interaction (i.e. the product) of industrial production percentage
change and unemployment rate.
(_Optional_: Calculate the year-to-year difference in each of unemployment rate
and industrial production for those months with available data for both time
series in 2019 and 2020. Note that there are several ways to do this
computation, some can be simpler than the solution sketch here)
```{r q3}
# Q3 --------------------------------------------------------------------------
# Data transformations with mutate
ipu_clean %>%
mutate(ip_sq = ip ^ 2,
unemployment_ln = log(unemployment),
ip_unemployment = ip * unemployment) %>%
head()
# Optional --------------------------------------------------------------------
library(lubridate)
(yeartoyear <- ipu_clean %>%
mutate(yr = year(dmy(date)),
mth = month(dmy(date))) %>%
select(-date) %>%
pivot_wider(id_cols = c("country", "mth"),
names_from = yr, names_prefix = "yr",
values_from = c("ip", "unemployment")) %>%
mutate(ip_yty = ip_yr2020 - ip_yr2019,
unemployment_yty = unemployment_yr2020 - unemployment_yr2019) %>%
select(country, mth, ip_yty, unemployment_yty) %>%
drop_na())
head(yeartoyear)
```
Now, save your final analysis make sure both your local git repository and the
remote copy on GitHub reflects your work so far. When you commit your changes,
use "finished second half" as your commit message.
```{bash, eval = FALSE}
git add 03-exercises-tabular-data.Rmd
git commit -m "finished second half"
git push --set-upstream origin final
git push
```
(Pop quiz: what is the third line above doing?)
Double check that all the work you did for the final analysis is not saved on
the `main` branch. **First, save and close this file.** Then switch to the
`main` branch as follows.
```{bash, eval = FALSE}
git checkout main
```
After you switch to the `main` branch, reopen this file and check that it no
longer contains the work you did while you were on the `final` branch.
Finally, merge the `final` branch back into the `main` branch by closing this
file then running the following command. After you run the command, reopen this
file (while you are on the `main` branch) and observe that the changes you made
on the `final` branch are now reflected on the `main` branch.
```{bash, eval = FALSE}
git merge final
```
Make sure that the changes that occurred from the merge are also reflected on
remote (on GitHub), by pushing while you are on `main`.
```{bash, eval = FALSE}
git push
```
Now delete the `final` branch now that you're done working with it. You can
delete the local copy of `final` using the following. Then you can navigate to
the branch on GitHub and delete it on the web.
```{bash, eval = FALSE}
git branch --delete final
```