week02/03-exercises-tabular-data-solution.Rmd

---
title: "MY472 - Week 2: Seminar exercises in tabular data"
author: "Ryan Hübert (AT 2024)"
date: "October 2024"
output: html_document
---

**Attribution statement:** _The following teaching materials have been 
iteratively developed by current and former instructors, including: Martin 
Lukac, Patrick Gildersleve, and Ryan Hübert._

In these exercises, you will use the `tidyverse` package to work with the 
dataset called `ip_and_unemployment.csv` that we used in lectures. 

Start by creating a new GitHub repository called `seminar02`, which you should
make private and include a README and a `.gitignore` file with the `R` template.

Using the command line (Terminal on a Mac or Git Bash on a Windows PC), clone 
your repository to your computer, putting it inside the folder where you would 
like to store your local copies of your repositories for this course. 
Then, navigate into your newly cloned repository using the `cd` command.

Note: In the below, you should replace `<path>` with the path to whichever 
folder you would like to store your local copy of your repo. You should replace 
`<url>` with the HTTPS link to your GitHub repo.

```{bash, eval = FALSE}
cd <path>
git clone <url>
cd seminar02
```

Now, save this Rmd file, close it, and move it into your local copy of the 
GitHub repo you just created and cloned. After you have done this, reopen this
file in RStudio. Also store copies of the two `csv` datasets in your repo folder.

Next, stage all the files you just added to your local repo. Note: the dot tells 
git to stage all the files in the folder, not just a single file. The dot can be 
dangerous because you may unintentionally stage files you do not want to have on 
GitHub. Before staging all the files in the repo folder with the dot, check 
what's in the folder by running the command `ls` in the terminal and check you 
are only staging files you want to save to GitHub. Once you're satisfied stage 
as follows.

```{bash, eval = FALSE}
git add .
```

Commit and push your changes to your repo, using "added seminar files" as your 
commit message.

```{bash, eval = FALSE}
git commit -m "added seminar files"
git push
```

Check on the GitHub website that this file now appears in your repository.

Now, let's get to work with R. Start with setting up your workspace:

```{r setup}
# load in libraries:
suppressMessages(library(tidyverse))
# or just library(tidyverse)

# read ip_and_unemployment.csv
ip_and_unemployment <- read.csv("ip_and_unemployment.csv")
head(ip_and_unemployment)

```
    
What are the highest unemployment rates for France and Spain during the time of 
the sample?  What are the lowest values for industrial production for the two 
countries? Make sure to delete NA values in only the time series of interest. 
(_Optional_: can you create a function that would do this for any country?)

```{r q1}
# Q1 --------------------------------------------------------------------------
# data processing
ipu_clean <- ip_and_unemployment %>%
  pivot_wider(id_cols = c("country", "date"), 
              names_from = "series", values_from = "value")  # long to wide

# France
ipu_clean %>%
  filter(country == "france") %>%
  filter(unemployment == max(unemployment, na.rm = TRUE) | 
           ip == min(ip, na.rm = TRUE)) 
    # need to add na.rm = TRUE, because ip contains NA and min() will return NA
    # if there is one NA, unless na.rm = TRUE

# Spain
ipu_clean %>%
  filter(country == "spain") %>%
  filter(unemployment == max(unemployment, na.rm = TRUE) | 
           ip == min(ip, na.rm = TRUE))

# Optional --------------------------------------------------------------------
filter_worst_months <- function(x) {
  filtered <- ipu_clean %>%
    filter(country == x) %>%
    filter(unemployment == max(unemployment, na.rm = TRUE) | 
             ip == min(ip, na.rm = TRUE))
  return(filtered)
}

filter_worst_months("germany")


```

-----

How many non-NA monthly observations of industrial production exist for the 
countries here. Can you  determine this with the group_by and summarise 
functions? (_Optional_: can you calculate the % of values that are non-NA?)

```{r q2} 
# Q2 --------------------------------------------------------------------------
# Non-NA group and summarise
ipu_clean %>%
  group_by(country) %>%
  summarise(nonNA_ip = sum(!is.na(ip)),
            nonNA_ue = sum(!is.na(unemployment)))

# Optional --------------------------------------------------------------------
ipu_clean %>%
  group_by(country) %>%
  summarise(nonNA_ip = sum(!is.na(ip)),
            nonNA_ue = sum(!is.na(unemployment)),
            nonNA_ip_pct = nonNA_ip / length(ip),
            nonNA_ue_pct = nonNA_ue / length(unemployment))
```

-----

In data science and machine learning, it can sometimes increase the predictive 
power of models to add transformations of existing variables. This is usually 
done in the modelling step, but to practice using the `mutate` function, let's do 
it here. 

However, before you do this, save your changes so far and make sure both your 
local git repository and the remote copy on GitHub reflects your work so far. 
When you commit your changes, use "finished first half" as your commit message.

```{bash, eval = FALSE}
git add 03-exercises-tabular-data.Rmd
git commit -m "finished first half"
git push
```

Before you do the final analysis, create a branch of your git repository that
you will use to complete the final analysis. Call this branch `final` to capture
the idea that you are using it for your final analysis. Switch to the final 
branch and double check you are on the final branch.

```{bash, eval = FALSE}
git branch final 
git checkout final
git branch
```

(After you finish your final analysis below, then at the very end, you will 
merge this new branch back into the `main` branch.)

Back to the R code. Add three new columns to the dataframe: 

  1. the square of the industrial production percentage change, 
  2. the natural logarithm of the unemployment rate, and 
  3. the interaction (i.e. the product) of industrial production percentage 
  change and unemployment rate.
  
(_Optional_: Calculate the year-to-year difference in each of unemployment rate 
and industrial production for those months with available data for both time 
series in 2019 and 2020. Note that there are several ways to do this 
computation, some can be simpler than the solution sketch here)

```{r q3}
# Q3 --------------------------------------------------------------------------
# Data transformations with mutate
ipu_clean %>%
  mutate(ip_sq = ip ^ 2,
         unemployment_ln = log(unemployment),
         ip_unemployment = ip * unemployment) %>%
  head()

# Optional --------------------------------------------------------------------
library(lubridate)
(yeartoyear <- ipu_clean %>%
  mutate(yr = year(dmy(date)),
         mth = month(dmy(date))) %>%
  select(-date) %>%
  pivot_wider(id_cols = c("country", "mth"),
              names_from = yr, names_prefix = "yr",
              values_from = c("ip", "unemployment")) %>%
  mutate(ip_yty = ip_yr2020 - ip_yr2019,
         unemployment_yty = unemployment_yr2020 - unemployment_yr2019) %>%
  select(country, mth, ip_yty, unemployment_yty) %>%
  drop_na())

head(yeartoyear)

```

Now, save your final analysis make sure both your local git repository and the 
remote copy on GitHub reflects your work so far. When you commit your changes, 
use "finished second half" as your commit message.

```{bash, eval = FALSE}
git add 03-exercises-tabular-data.Rmd
git commit -m "finished second half"
git push --set-upstream origin final
git push 
```

(Pop quiz: what is the third line above doing?)

Double check that all the work you did for the final analysis is not saved on 
the `main` branch. **First, save and close this file.** Then switch to the 
`main` branch as follows.

```{bash, eval = FALSE}
git checkout main 
```

After you switch to the `main` branch, reopen this file and check that it no 
longer contains the work you did while you were on the `final` branch. 

Finally, merge the `final` branch back into the `main` branch by closing this 
file then running the following command. After you run the command, reopen this 
file (while you are on the `main` branch) and observe that the changes you made
on the `final` branch are now reflected on the `main` branch.

```{bash, eval = FALSE}
git merge final
```

Make sure that the changes that occurred from the merge are also reflected on 
remote (on GitHub), by pushing while you are on `main`.

```{bash, eval = FALSE}
git push
```

Now delete the `final` branch now that you're done working with it. You can 
delete the local copy of `final` using the following. Then you can navigate to 
the branch on GitHub and delete it on the web.

```{bash, eval = FALSE}
git branch --delete final
```