R4DS_7.Rmd

---
title: "R4DS 7-Exploratory Data Analysis"
output:
  ioslides_presentation:
    incremental: yes
  beamer_presentation:
    incremental: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Exploratory Data Analysis (EDA)

## Exploratory Data Analysis (EDA)

1. Generate questions about your data.

2. Search for answers by visualising, transforming, and modelling your data.

3. Use what you learn to refine your questions and/or generate new questions.


## 7.1.1 Prerequisites

```{r}
library(tidyverse)
```

## 7.2 Questions

- EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.

- It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. 

- On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.

# 7.3 Variation

# 7.3.1 Visualising distributions
## Categorical var: 
```{r}
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))
```

## Categorical var:  
```{r}
diamonds %>% 
  count(cut)
```

## Continuous var:
```{r}
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
```

## Continuous var:  
```{r}
diamonds %>% 
  count(cut_width(carat, 0.5))
```

## Continuous var:

Zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.

```{r, eval=FALSE}
smaller <- diamonds %>% 
  filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)
```

## Continuous var:

```{r, echo=FALSE}
smaller <- diamonds %>% 
  filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)
```


## Continuous var: {.smaller}

Overlay multiple histograms in the same plot

```{r}
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)
```

## 7.3.2 Typical values

Look for anything unexpected:

- Which values are the most common? Why?
- Which values are rare? Why? Does that match your expectations?
- Can you see any unusual patterns? What might explain them?

## 7.3.2 Typical values

As an example, the histogram (next slide) suggests several interesting questions:

- Why are there more diamonds at whole carats and common fractions of carats?
- Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
- Why are there no diamonds bigger than 3 carats?

## 7.3.2 Typical values {.smaller}

```{r}
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)
```

## 7.3.2 Typical values

Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:

- How are the observations within each cluster similar to each other?

- How are the observations in separate clusters different from each other?

- How can you explain or describe the clusters?

- Why might the appearance of clusters be misleading?

## 7.3.2 Typical values {.smaller}

The histogram below shows the length (in mins) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 mins) and long eruptions (4-5 mins), but little in between.

```{r, echo=FALSE}
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_histogram(binwidth = 0.25)
```

# 7.3.3 Unusual values

## 7.3.3 Unusual values

- Outliers are observations that are unusual; data points that don’t seem to fit the pattern. 

- Sometimes outliers are data entry errors; other times outliers suggest important new science. 

- When you have a lot of data, outliers are sometimes difficult to see in a histogram. 

- For example, take the distribution of the y variable from the diamonds dataset. The only  evidence of outliers is the unusually wide limits on the x-axis.

## 7.3.3 Unusual values {.smaller}

```{r}
ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)
```

## 7.3.3 Unusual values {.smaller}

```{r}
ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))  # zoom in
```

## 7.3.3 Unusual values {.smaller}

```{r}
unusual <- diamonds %>% 
  filter(y < 3 | y > 20) %>% 
  select(price, x, y, z) %>%
  arrange(y)
unusual
```

x,y,z variables measure three dimensions of diamonds. Impossible to have these values!

# 7.4 Missing values

## 7.4 Missing values

1.Drop the entire row with the strange values:

```{r}
diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20))
```

2.Replace the unusual values with missing values. (a bit better)

```{r}
diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))
```

## 7.4 Missing values {.smaller}

```{r, out.width='80%'}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point()            # show warning
```

## 7.4 Missing values {.smaller}

```{r, out.width='80%'}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point(na.rm = TRUE)  # suppress the warnings
```

## 7.4 Missing values

- In `nycflights13::flights`, missing values in the dep_time variable indicate that the flight was cancelled. 

- So you might want to compare the scheduled departure times for cancelled and non-cancelled times. 

- You can do this by making a new variable with `is.na()`.

## 7.4 Missing values

```{r, eval = FALSE}
nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
  geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
```

## 7.4 Missing values {.smaller}

```{r, echo=FALSE}
nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
  geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
```

However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.

# 7.5 Covariation

## 7.5 Covariation

- If variation describes the behavior within a variable, covariation describes the behavior between variables. 

- Covariation is the tendency for the values of two or more variables to vary together in a related way. 

- The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.

## 7.5.1 Categorical-continuous variable {.smaller}

```{r}
ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```

## 7.5.1 Categorical-continuous variable {.smaller}

It’s hard to see the difference in distribution because the overall counts differ so much:

```{r, out.width='90%'}
ggplot(diamonds) + geom_bar(mapping = aes(x = cut))
```

## 7.5.1 Categorical-continuous variable {.smaller}

To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying **count**, we’ll display **density**:

```{r, out.width='60%'}
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
```

There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price!

## Boxplot {.smaller}

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. Each boxplot consists of:

- A box that stretches from the 25th percentile - 75th percentile of the distribution, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.

- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.

- A line (or whisker) that extends from each end of the box and goes to the
farthest non-outlier point in the distribution.

## Boxplot 

```{r, echo=FALSE, fig.align='center', out.width='100%'}
knitr::include_graphics('https://d33wubrfki0l68.cloudfront.net/153b9af53b33918353fda9b691ded68cd7f62f51/5b616/images/eda-boxplot.png')
```

## Boxplot

```{r}
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()
```

## Boxplot

`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the `reorder()` function.

## Boxplot {.smaller}

For example, take the class variable in the mpg dataset. You might be interested to know how highway mileage varies across classes:

```{r, out.width='80%'}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()
```

## Boxplot {.smaller}

To make the trend easier to see, we can reorder `class` based on the median value of `hwy`:

```{r, out.width='80%'}
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
```

## Boxplot {.smaller}

If you have long variable names, `geom_boxplot()` will work better if you flip it 90°.

```{r, out.width='80%'}
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()
```

## 7.5.2 Two categorical variables {.smaller}

```{r, out.width='90%'}
ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))
```

## 7.5.2 Two categorical variables {.smaller}

Another approach is to compute the count with `dplyr`:

```{r}
diamonds %>% 
  count(color, cut)
```

## 7.5.2 Two categorical variables {.smaller}

Then visualise with `geom_tile()` and the fill aesthetic:

```{r, out.width='70%'}
diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))
```

## 7.5.3 Two continuous variables {.smaller}

```{r, out.width='90%'}
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))
```

## 7.5.3 Two continuous variables {.smaller}

Scatterplots become less useful as the size of your dataset grows, because points begin to overplot.

```{r, out.width='80%'}
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
```

## 7.5.3 Two continuous variables {.smaller}

```{r, out.width='50%', fig.show='hold'}
ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))
```

## 7.5.3 Two continuous variables {.smaller}

Another option is to bin one continuous variable so it acts like a categorical variable.

```{r, out.width='80%'}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
```

## 7.5.3 Two continuous variables {.smaller}

Another approach is to display approximately the same number of points in each bin. 

```{r, out.width='80%'}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
```

## 7.6 Patterns and models

Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

- Could this pattern be due to coincidence (i.e. random chance)?

- How can you describe the relationship implied by the pattern?

- How strong is the relationship implied by the pattern?

- What other variables might affect the relationship?

- Does the relationship change if you look at individual subgroups of the data?

## 7.5.3 Two continuous variables {.smaller}

A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters.

```{r, out.width='70%'}
ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))
```

## Patterns

- Patterns provide one of the most useful tools for data scientists because they reveal covariation. 

- If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. 

- If two variables covary, you can use the values of one variable to make better predictions about the values of the second. 

- If the covariation is due to a **causal** relationship (a special case), then you can use the value of one variable to control the value of the second.

## Models

- Models are a tool for extracting patterns out of data. 

- For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. 

- It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. 

## Models {.smaller}

The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.

```{r, eval=FALSE}
library(modelr)

mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))
```

## Models 

```{r, echo = FALSE}
library(modelr)

mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))
```

## Models {.smaller}

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

```{r, out.width='70%'}
ggplot(data = diamonds2) + 
  geom_boxplot(mapping = aes(x = cut, y = resid))
```

# End of Chapter