transform.Rmd

---
output: html_document
editor_options:
  chunk_output_type: console
---
# Data transformation {#transform .r4ds-section}

## Introduction {#introduction-2 .r4ds-section}

```{r setup,message=FALSE,cache=FALSE}
library("nycflights13")
library("tidyverse")
```

## Filter rows with `filter()` {#filter-rows-with-filter .r4ds-section}

### Exercise 5.2.1 {.unnumbered .exercise data-number="5.2.1"}

<div class="question">
Find all flights that

1.  Had an arrival delay of two or more hours
1.  Flew to Houston (IAH or HOU)
1.  Were operated by United, American, or Delta
1.  Departed in summer (July, August, and September)
1.  Arrived more than two hours late, but didn’t leave late
1.  Were delayed by at least an hour, but made up over 30 minutes in flight
1.  Departed between midnight and 6 am (inclusive)

</div>

<div class="answer">

The answer to each part follows.

1.  Since the `arr_delay` variable is measured in minutes, find
    flights with an arrival delay of 120 or more minutes.

    ```{r ex-5.2.1-1, indent = 4}
    filter(flights, arr_delay >= 120)
    ```

1.  The flights that flew to Houston are those flights where the 
    destination (`dest`) is either "IAH" or "HOU".
    ```{r ex-5.2.1-2, indent=4}
    filter(flights, dest == "IAH" | dest == "HOU")
    ```
    However, using `%in%` is more compact and would scale to cases where 
    there were more than two airports we were interested in.
    ```{r ex-5.2.1-3, indent=4}
    filter(flights, dest %in% c("IAH", "HOU"))
    ```
    

1.  In the `flights` dataset, the column `carrier` indicates the airline, but it uses two-character carrier codes.
    We can find the carrier codes for the airlines in the `airlines` dataset.
    Since the carrier code dataset only has `r nrow(airlines)` rows, and the names
    of the airlines in that dataset are not exactly "United", "American", or "Delta",
    it is easiest to manually look up their carrier codes in that data.

    ```{r ex-5.2.1-4,indent=4}
    airlines
    ```

    The carrier code for Delta is `"DL"`, for American is `"AA"`, and for United is `"UA"`.
    Using these carriers codes, we check whether `carrier` is one of those.

    ```{r, indent=4}
    filter(flights, carrier %in% c("AA", "DL", "UA"))
    ```

1.  The variable `month` has the month, and it is numeric.
    So, the summer flights are those that departed in months 7 (July), 8 (August), and 9 (September).
    ```{r, indent=4}
    filter(flights, month >= 7, month <= 9)
    ```
    The `%in%` operator is an alternative. If the `:` operator is used to specify
    the integer range, the expression is readable and compact.
    ```{r, indent=4}
    filter(flights, month %in% 7:9)
    ```
    We could also use the `|` operator. However, the `|` does not scale to 
    many choices. 
    Even with only three choices, it is quite verbose.
    ```{r, indent=4}
    filter(flights, month == 7 | month == 8 | month == 9)
    ```
    We can also use the `between()` function as shown in [Exercise 5.2.2](#exercise-5.2.2).

1.  Flights that arrived more than two hours late, but didn’t leave late will 
    have an arrival delay of more than 120 minutes (`arr_delay > 120`) and 
    a non-positive departure delay (`dep_delay <= 0`).
    ```{r, indent=4}
    filter(flights, arr_delay > 120, dep_delay <= 0)
    ```

1.  Were delayed by at least an hour, but made up over 30 minutes in flight.
    If a flight was delayed by at least an hour, then `dep_delay >= 60`. 
    If the flight didn't make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning `dep_delay == arr_delay`, or alternatively, `dep_delay - arr_delay == 0`. 
    If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as `dep_delay - arr_delay > 30`.
    ```{r}
    filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
    ```

1.  Finding flights that departed between midnight and 6 a.m. is complicated by 
    the way in which times are represented in the data.  
    In `dep_time`, midnight is represented by `2400`, not `0`.
    You can verify this by checking the minimum and maximum of `dep_time`.
    ```{r}
    summary(flights$dep_time)
    ```
    This is an example of why it is always good to check the summary statistics of your data.
    Unfortunately, this means we cannot simply check that `dep_time < 600`, because we also have
    to consider the special case of midnight.
    
    ```{r}
    filter(flights, dep_time <= 600 | dep_time == 2400)
    ```

    Alternatively, we could use the [modulo operator](https://en.wikipedia.org/wiki/Modulo_operation), `%%`. 
    The modulo operator returns the remainder of division.
    Let's see how this affects our times.
    ```{r}
    c(600, 1200, 2400) %% 2400
    ```

    Since `2400 %% 2400 == 0` and all other times are left unchanged, 
    we can compare the result of the modulo operation to `600`,

    ```{r}
    filter(flights, dep_time %% 2400 <= 600)
    ```

    This filter expression is more compact, but its readability depends on the 
    familiarity of the reader with modular arithmetic.

</div>

### Exercise 5.2.2 {.unnumbered .exercise data-number="5.2.2"}

<div class="question">
Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
</div>

<div class="answer">

The expression `between(x, left, right)` is equivalent to `x >= left & x <= right`.

Of the answers in the previous question, we could simplify the statement of *departed in summer* (`month >= 7 & month <= 9`) using the `between()` function.
```{r}
filter(flights, between(month, 7, 9))
```

</div>

### Exercise 5.2.3 {.unnumbered .exercise data-number="5.2.3"}

<div class="question">
How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?
</div>

<div class="answer">

Find the rows of flights with a missing departure time (`dep_time`) using the `is.na()` function.
```{r}
filter(flights, is.na(dep_time))
```

Notably, the arrival time (`arr_time`) is also missing for these rows. These seem to be cancelled flights.

The output of the function `summary()` includes the number of missing values for all non-character variables.
```{r}
summary(flights)
```

</div>

### Exercise 5.2.4 {.unnumbered .exercise data-number="5.2.4"}

<div class="question">
Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing?
Why is `FALSE & NA` not missing? Can you figure out the general rule?
(`NA * 0` is a tricky counterexample!)
</div>

<div class="answer">

```{r}
NA ^ 0
```

`NA ^ 0 == 1` since for all numeric values $x ^ 0 = 1$.

```{r}
NA | TRUE
```

`NA | TRUE` is `TRUE` because anything **or** `TRUE` is `TRUE`. 
If the missing value were `TRUE`, then `TRUE | TRUE == TRUE`,
and if the missing value was `FALSE`, then `FALSE | TRUE == TRUE`.

```{r}
NA & FALSE
```

The value of `NA & FALSE` is `FALSE` because anything **and** `FALSE` is always `FALSE`.
If the missing value were `TRUE`, then `TRUE & FALSE == FALSE`,
and if the missing value was `FALSE`, then `FALSE & FALSE == FALSE`.

```{r}
NA | FALSE
```

For `NA | FALSE`, the value is unknown since `TRUE | FALSE == TRUE`, but `FALSE | FALSE == FALSE`.

```{r}
NA & TRUE
```

For `NA & TRUE`, the value is unknown since `FALSE & TRUE== FALSE`, but `TRUE & TRUE == TRUE`.

```{r}
NA * 0
```

Since $x * 0 = 0$ for all finite numbers we might expect `NA * 0 == 0`, but that's not the case.
The reason that `NA * 0 != 0` is that $0 \times \infty$ and $0 \times -\infty$ are undefined.
R represents undefined results as `NaN`, which is an abbreviation of "[not a number](https://en.wikipedia.org/wiki/NaN)".

```{r}
Inf * 0
-Inf * 0
```

</div>

## Arrange rows with `arrange()` {#arrange-rows-with-arrange .r4ds-section}

### Exercise 5.3.1 {.unnumbered .exercise data-number="5.3.1"}

<div class="question">
How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`).
</div>

<div class="answer">

The `arrange()` function puts `NA` values last.
```{r}
arrange(flights, dep_time) %>%
  tail()
```
Using `desc()` does not change that.
```{r}
arrange(flights, desc(dep_time))
```

To put `NA` values first, we can add an indicator of whether the column has a missing value.
Then we sort by the missing indicator column and the column of interest. 
For example, to sort the data frame by departure time (`dep_time`) in ascending order but `NA` values first, run the following.
```{r}
arrange(flights, desc(is.na(dep_time)), dep_time)
```
The `flights`  will first be sorted by `desc(is.na(dep_time))`.
Since `desc(is.na(dep_time))` is either `TRUE` when `dep_time` is missing, or `FALSE`, when it is not, the rows with missing values of `dep_time` will come first, since `TRUE > FALSE`.

</div>

### Exercise 5.3.2 {.unnumbered .exercise data-number="5.3.2"}

<div class="question">
Sort flights to find the most delayed flights. Find the flights that left earliest.
</div>

<div class="answer">

```{r include=FALSE,purl=FALSE}
most_delayed <- filter(flights, dep_delay == max(dep_delay, na.rm = TRUE)) %>%
  mutate(date = lubridate::make_datetime(
    year, month, day,
    sched_dep_time %/% 100,
    sched_dep_time %% 100
  ))
left_earliest <- filter(flights, dep_delay == min(dep_delay, na.rm = TRUE)) %>%
  mutate(date = lubridate::make_datetime(
    year, month, day,
    sched_dep_time %/% 100,
    sched_dep_time %% 100
  ))
```

Find the most delayed flights by sorting the table by departure delay, `dep_delay`, in descending order.
```{r}
arrange(flights, desc(dep_delay))
```
The most delayed flight was `r most_delayed$carrier` `r most_delayed$flight`, `r most_delayed$origin` to `r most_delayed$dest`, which was scheduled to leave on `r format(most_delayed$date, "%B %d, %Y %H:%M")`.
Note that the departure time is given as `r most_delayed$dep_time`, which seems to be less than the scheduled departure time.
But the departure was delayed `r comma_int(most_delayed$dep_delay)` minutes, which is `r most_delayed$dep_delay %/% 60` hours, `r most_delayed$dep_delay %% 60` minutes.
The departure time is the day after the scheduled departure time.
Be happy that you weren't on that flight, and if you happened to have been on that flight and are reading this, I'm sorry for you.

Similarly, the earliest departing flight can be found by sorting `dep_delay` in ascending order.
```{r}
arrange(flights, dep_delay)
```
Flight `r left_earliest$carrier` `r left_earliest$flight` (`r left_earliest$origin` to `r left_earliest$dest`) scheduled to depart on `r format(left_earliest$date, "%B %d, %Y at %H:%M")`
departed `r comma_int(abs(left_earliest$dep_delay))` minutes early.

</div>

### Exercise 5.3.3 {.unnumbered .exercise data-number="5.3.3"}

<div class="question">
Sort flights to find the fastest flights.
</div>

<div class="answer">

There are actually two ways to interpret this question: one that can be solved by using `arrange()`, and a more complex interpretation that requires creation of a new variable using `mutate()`, which we haven't seen demonstrated before. 

The colloquial interpretation of "fastest" flight can be understood to mean "the flight with the shortest flight time". We can use arrange to sort our data by the `air_time` variable to find the shortest flights:

```{r}
head(arrange(flights, air_time))
```

Another definition of the "fastest flight" is the flight with the highest average [ground speed](https://en.wikipedia.org/wiki/Ground_speed).
The ground speed is not included in the data, but it can be calculated from the `distance` and `air_time` of the flight.

```{r}
head(arrange(flights, desc(distance / air_time)))
```

<!-- note cannot use select() or mutate() in these answers since they are not introduced yet -->

</div>

### Exercise 5.3.4 {.unnumbered .exercise data-number="5.3.4"}

<div class="question">

Which flights traveled the longest?
Which traveled the shortest?

</div>

<div class="answer">

```{r include=FALSE,purl=FALSE}
longest <- filter(flights, distance == max(distance)) %>%
  select(carrier, flight, origin, dest) %>%
  distinct() %>%
  slice(1)
shortest <- filter(flights, distance == min(distance)) %>%
  select(carrier, flight, origin, dest) %>%
  distinct() %>%
  slice(1)
```
To find the longest flight, sort the flights by the `distance` column in descending order.
```{r}
arrange(flights, desc(distance))
```
The longest flight is `r longest$carrier` `r longest$flight`, `r longest$origin` to `r longest$dest`, which is `r comma_int(max(flights$distance))` miles.

To find the shortest flight, sort the flights by the `distance` in ascending order, which is the default sort order.
```{r}
arrange(flights, distance)
```
The shortest flight is `r shortest$carrier` `r shortest$flight`, `r shortest$origin` to `r shortest$dest`, which is only `r comma_int(min(flights$distance))` miles.
This is a flight between two of the New York area airports.
However, since this flight is missing a departure time so it either did not actually fly or there is a problem with the data.

The terms "longest" and "shortest" could also refer to the time of the flight instead of the distance.
Now the longest and shortest flights by can be found by sorting by the `air_time` column.
The longest flights by airtime are the following.
```{r}
arrange(flights, desc(air_time))
```
The shortest flights by airtime are the following.
```{r}
arrange(flights, air_time)
```

</div>

## Select columns with `select()` {#select .r4ds-section}

### Exercise 5.4.1 {.unnumbered .exercise data-number="5.4.1"}

<div class="question">
Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights.
</div>

<div class="answer">

These are a few ways to select columns.

-   Specify columns names as unquoted variable names.
    ```{r}
    select(flights, dep_time, dep_delay, arr_time, arr_delay)
    ```

-   Specify column names as strings.
    ```{r}
    select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
    ```

-   Specify the column numbers of the variables.
    ```{r}
    select(flights, 4, 6, 7, 9)
    ```
    This works, but is not good practice for two reasons.
    First, the column location of variables may change, resulting in code that 
    may continue to run without error, but produce the wrong answer. 
    Second code is obfuscated, since it is not clear from the code which 
    variables are being selected. What variable does column 6 correspond to? 
    I just wrote the code, and I've already forgotten.

-   Specify the names of the variables with character vector and `any_of()` or `all_of()` 
    ```{r}
    select(flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
    ```
    ```{r}
    select(flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
    ```
    This is useful because the names of the variables can be stored in a 
    variable and passed to `all_of()` or `any_of()`.
    ```{r}
    variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
    select(flights, all_of(variables))
    ```
    These two functions replace the deprecated function `one_of()`.

-   Selecting the variables by matching the start of their names using `starts_with()`.
    ```{r}
    select(flights, starts_with("dep_"), starts_with("arr_"))
    ```

-   Selecting the variables using regular expressions with `matches()`.
    Regular expressions provide a flexible way to match string patterns
    and are discussed in the [Strings](https://r4ds.had.co.nz/strings.html) chapter.
    ```{r}
    select(flights, matches("^(dep|arr)_(time|delay)$"))
    ```

-   Specify the names of the variables with a character vector and use the bang-bang operator (`!!`). 
    ```{r}
    variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
    select(flights, !!variables)
    ```
    This and the following answers use the features of **tidy evaluation** not covered in R4DS but covered in the [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) vignette.

-   Specify the names of the variables in a character or list vector and use the bang-bang-bang operator.
    ```{r}
    variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
    select(flights, !!!variables)
    ```  

-   Specify the unquoted names of the variables in a list using `syms()` and use the bang-bang-bang operator.
    ```{r}
    variables <- syms(c("dep_time", "dep_delay", "arr_time", "arr_delay"))
    select(flights, !!!variables)
    ```        

Some things that **don't** work are:

-   Matching the ends of their names using `ends_with()` since this will incorrectly
    include other variables. For example,
    ```{r}
    select(flights, ends_with("arr_time"), ends_with("dep_time"))
    ```

-   Matching the names using `contains()` since there is not a pattern that can
    include all these variables without incorrectly including others.
    ```{r}
    select(flights, contains("_time"), contains("arr_"))
    ```

</div>

### Exercise 5.4.2 {.unnumbered .exercise data-number="5.4.2"}

<div class="question">
What happens if you include the name of a variable multiple times in a `select()` call?
</div>

<div class="answer">

The `select()` call ignores the duplication. Any duplicated variables are only included once, in the first location they appear. The `select()` function does not raise an error or warning or print any message if there are duplicated variables.
```{r}
select(flights, year, month, day, year, year)
```

This behavior is useful because it means that we can use `select()` with `everything()` 
in order to easily change the order of columns without having to specify the names 
of all the columns.
```{r}
select(flights, arr_delay, everything())
```

</div>

### Exercise 5.4.3 {.unnumbered .exercise data-number="5.4.3"}

<div class="question">
What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
</div>

<div class="answer">

The `one_of()` function selects variables with a character vector rather than unquoted variable name arguments.
This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.

```{r}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
```

In the most recent versions of **dplyr**, `one_of` has been deprecated in favor of two functions: `all_of()` and `any_of()`.
These functions behave similarly if all variables are present in the data frame.
```{r}
select(flights, any_of(vars))
```
```{r}
select(flights, all_of(vars))
```

These functions differ in their strictness. 
The function `all_of()` will raise an error if one of the variable names is not present, while `any_of()` will ignore it.
```{r error=TRUE}
vars2 <- c("year", "month", "day", "variable_not_in_the_dataframe")
select(flights, all_of(vars2))
```
```{r}
select(flights, any_of(vars2))
```
The deprecated function `one_of()` will raise a warning if an unknown column is encountered.
```{r}
select(flights, one_of(vars2))
```

In the most recent versions of **dplyr**, the `one_of()` function is less necessary due to new behavior in the selection functions.
The `select()` function can now accept the name of a vector containing the variable names you wish to select:
```{r}
select(flights, vars)
```
However there is a problem with the previous code.
The name `vars` could refer to a column named `vars` in `flights` or a different variable named `vars`.
What th code does will depend on whether or not `vars` is a column in `flights`.
If `vars` was a column in `flights`, then that code would only select the `vars` column. 
For example:
```{r}
flights <- mutate(flights, vars = 1)
select(flights, vars)
```
```{r include=FALSE}
flights <- select(flights, -vars)
```

However,  `vars` is not a column in `flights`, as is the case, then `select` will use the value the value of the , and select those columns.
If it has the same name or to ensure that it will not conflict with the names of the columns in the data frame, use the `!!!` (bang-bang-bang) operator.
```{r}
select(flights, !!!vars)
```
This behavior, which is used by many **tidyverse** functions, is an example of what is called non-standard evaluation (NSE) in R. See the **dplyr** vignette, [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html), for more information on this topic.

</div>

### Exercise 5.4.4 {.unnumbered .exercise data-number="5.4.4"}

<div class="question">
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
</div>

<div class="answer">

```{r}
select(flights, contains("TIME"))
```

The default behavior for `contains()` is to ignore case.
This may or may not surprise you.
If this behavior does not surprise you, that could be why it is the default.
Users searching for variable names probably have a better sense of the letters
in the variable than their capitalization.
A second, technical, reason is that dplyr works with more than R data frames.
It can also work with a variety of [databases](https://db.rstudio.com/dplyr/).
Some of these database engines have case insensitive column names, so making functions that match variable names
case insensitive by default will make the behavior of
`select()` consistent regardless of whether the table is
stored as an R data frame or in a database.

To change the behavior add the argument `ignore.case = FALSE`.

```{r}
select(flights, contains("TIME", ignore.case = FALSE))
```

</div>

## Add new variables with `mutate()` {#add-new-variables-with-mutate .r4ds-section}

### Exercise 5.5.1 {.unnumbered .exercise data-number="5.5.1"}

<div class="question">
Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
</div>

<div class="answer">

To get the departure times in the number of minutes, divide `dep_time` by 100 to get the hours since midnight and multiply by 60 and add the remainder of `dep_time` divided by 100.
For example, `1504` represents 15:04 (or 3:04 PM), which is `r 15 * 60 + 4` minutes after midnight.
To generalize this approach, we need a way to split out the hour-digits from the minute-digits.
Dividing by 100 and discarding the remainder using the integer division operator, `%/%` gives us the following.
```{r}
1504 %/% 100
```
Instead of `%/%` could also use `/` along with `trunc()` or `floor()`, but `round()` would not work.
To get the minutes, instead of discarding the remainder of the division by `100`,
we only want the remainder.
So we use the modulo operator, `%%`, discussed in the [Other Useful Functions](https://r4ds.had.co.nz/transform.html#select) section.
```{r}
1504 %% 100
```
Now, we can combine the hours (multiplied by 60 to convert them to minutes) and
minutes to get the number of minutes after midnight.
```{r}
1504 %/% 100 * 60 + 1504 %% 100
```

There is one remaining issue. Midnight is represented by `2400`, which would 
correspond to `1440` minutes since midnight, but it should correspond to `0`.
After converting all the times to minutes after midnight, `x %% 1440` will convert
`1440` to zero while keeping all the other times the same.

Now we will put it all together.
The following code creates a new data frame `flights_times` with columns `dep_time_mins` and `sched_dep_time_mins`.
These columns convert `dep_time` and `sched_dep_time`, respectively, to minutes since midnight.
```{r}
flights_times <- mutate(flights,
  dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
  sched_dep_time_mins = (sched_dep_time %/% 100 * 60 +
    sched_dep_time %% 100) %% 1440
)
# view only relevant columns
select(
  flights_times, dep_time, dep_time_mins, sched_dep_time,
  sched_dep_time_mins
)
```

Looking ahead to the [Functions](https://r4ds.had.co.nz/functions.html) chapter,
this is precisely the sort of situation in which it would make sense to write 
a function to avoid copying and pasting code.
We could define a function `time2mins()`, which converts a vector of times in
from the format used in `flights` to minutes since midnight.
```{r}
time2mins <- function(x) {
  (x %/% 100 * 60 + x %% 100) %% 1440
}
```
Using `time2mins`, the previous code simplifies to the following.
```{r}
flights_times <- mutate(flights,
  dep_time_mins = time2mins(dep_time),
  sched_dep_time_mins = time2mins(sched_dep_time)
)
# show only the relevant columns
select(
  flights_times, dep_time, dep_time_mins, sched_dep_time,
  sched_dep_time_mins
)
```

</div>

### Exercise 5.5.2 {.unnumbered .exercise data-number="5.5.2"}

<div class="question">

Compare `air_time` with `arr_time - dep_time`. 
What do you expect to see? 
What do you see? 
What do you need to do to fix it?

</div>

<div class="answer">

I expect that `air_time` is the difference between the arrival (`arr_time`) and departure times (`dep_time`).
In other words, `air_time = arr_time - dep_time`.

To check that this relationship, I'll first need to convert the times to a form more amenable to arithmetic operations using the same calculations as the [previous exercise](#exercise-5.5.1).
```{r}
flights_airtime <-
  mutate(flights,
    dep_time = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
    arr_time = (arr_time %/% 100 * 60 + arr_time %% 100) %% 1440,
    air_time_diff = air_time - arr_time + dep_time
  )
```

So, does `air_time = arr_time - dep_time`?
If so, there should be no flights with non-zero values of `air_time_diff`.
```{r}
nrow(filter(flights_airtime, air_time_diff != 0))
```

It turns out that there are many flights for which `air_time != arr_time - dep_time`.
Other than data errors, I can think of two reasons why `air_time` would not equal `arr_time - dep_time`.

1.  The flight passes midnight, so `arr_time < dep_time`.
    In these cases, the difference in airtime should be by 24 hours (1,440 minutes).

1.  The flight crosses time zones, and the total air time will be off by hours (multiples of 60). 
    All flights in `flights` departed from New York City and are domestic flights in the US.
    This means that flights will all be to the same or more westerly time zones.
    Given the time-zones in the US, the differences due to time-zone should be 60 minutes (Central)
    120 minutes (Mountain), 180 minutes (Pacific), 240 minutes (Alaska), or 300 minutes (Hawaii).
    
Both of these explanations have clear patterns that I would expect to see if they 
were true. 
In particular, in both cases, since time-zones and crossing midnight only affects the hour part of the time, all values of `air_time_diff` should be divisible by 60.
I'll visually check this hypothesis by plotting the distribution of `air_time_diff`.
If those two explanations are correct, distribution of `air_time_diff` should comprise only spikes at multiples of 60.
```{r}
ggplot(flights_airtime, aes(x = air_time_diff)) +
  geom_histogram(binwidth = 1)
```
This is not the case.
While, the distribution of `air_time_diff` has modes at multiples of 60 as hypothesized, 
it shows that there are many flights in which the difference between air time and local arrival and departure times is not divisible by 60.

Let's also look at flights with Los Angeles as a destination.
The discrepancy should be 180 minutes.
```{r}
ggplot(filter(flights_airtime, dest == "LAX"), aes(x = air_time_diff)) +
  geom_histogram(binwidth = 1)
```

To fix these time-zone issues, I would want to convert all the times to a date-time to handle overnight flights, and from local time to a common time zone, most likely [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time), to handle flights crossing time-zones.
The `tzone` column of `nycflights13::airports` gives the time-zone of each airport.
See the ["Dates and Times"](https://r4ds.had.co.nz/dates-and-times.html) for an introduction on working with date and time data.

But that still leaves the other differences unexplained. 
So what else might be going on? 
There seem to be too many problems for this to be data entry problems, so I'm probably missing something. 
So, I'll reread the documentation to make sure that I understand the definitions of `arr_time`, `dep_time`, and
`air_time`. 
The documentation contains a link to the source of the `flights` data, <https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236>.
This documentation shows that the `flights` data does not contain the variables `TaxiIn`, `TaxiOff`, `WheelsIn`, and `WheelsOff`.
It appears that the `air_time` variable refers to flight time, which is defined as the time between wheels-off (take-off) and wheels-in (landing).
But the flight time does not include time spent on the runway taxiing to and from gates.
With this new understanding of the data, I now know that the relationship between `air_time`, `arr_time`, and `dep_time` is `air_time <= arr_time - dep_time`, supposing that the time zones of `arr_time` and `dep_time` are in the same time zone.

</div>

### Exercise 5.5.3 {.unnumbered .exercise data-number="5.5.3"}

<div class="question">
Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?
</div>

<div class="answer">

I would expect the departure delay (`dep_delay`) to be equal to the difference between  scheduled departure time (`sched_dep_time`), and actual departure time (`dep_time`),
`dep_time - sched_dep_time = dep_delay`.

As with the previous question, the first step is to convert all times to the 
number of minutes since midnight.
The column, `dep_delay_diff`, is the difference between the column, `dep_delay`, and 
departure delay calculated directly from the scheduled and actual departure times.
```{r}
flights_deptime <-
  mutate(flights,
    dep_time_min = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
    sched_dep_time_min = (sched_dep_time %/% 100 * 60 +
      sched_dep_time %% 100) %% 1440,
    dep_delay_diff = dep_delay - dep_time_min + sched_dep_time_min
  )
```
Does `dep_delay_diff` equal zero for all rows? 
```{r}
filter(flights_deptime, dep_delay_diff != 0)
```
No. Unlike the last question, time zones are not an issue since we are only 
considering departure times.[^daylight]
However, the discrepancies could be because a flight was scheduled to depart 
before midnight, but was delayed after midnight.
All of these discrepancies are exactly equal to 1440 (24 hours), and the flights with these discrepancies were scheduled to depart later in the day.
```{r}
ggplot(
  filter(flights_deptime, dep_delay_diff > 0),
  aes(y = sched_dep_time_min, x = dep_delay_diff)
) +
  geom_point()
```
Thus the only cases in which the departure delay is not equal to the difference
in scheduled departure and actual departure times is due to a quirk in how these
columns were stored.

</div>

### Exercise 5.5.4 {.unnumbered .exercise data-number="5.5.4"}

<div class="question">

Find the 10 most delayed flights using a ranking function.
How do you want to handle ties? 
Carefully read the documentation for `min_rank()`.

</div>

<div class="answer">

The **dplyr** package provides multiple functions for ranking, which differ in how they handle tied values: `row_number()`, `min_rank()`, `dense_rank()`.
To see how they work, let's create a data frame with duplicate values in a vector and see how ranking functions handle ties.
```{r}
rankme <- tibble(
  x = c(10, 5, 1, 5, 5)
)
```
<!-- don't use 1-3 in order to avoid confusion with the rank function itself,
     don't have them in order -->

```{r}
rankme <- mutate(rankme,
  x_row_number = row_number(x),
  x_min_rank = min_rank(x),
  x_dense_rank = dense_rank(x)
)
arrange(rankme, x)
```

The function `row_number()` assigns each element a unique value.
The result is equivalent to the index (or row) number of each element after sorting the vector, hence its name.

The`min_rank()` and `dense_rank()` assign tied values the same rank, but differ in how they assign values to the next rank.
For each set of tied values the `min_rank()` function assigns a rank equal to the number of values less than that tied value plus one.
In contrast, the `dense_rank()` function assigns a rank equal to the number of distinct values less than that tied value plus one.
To see the difference between `dense_rank()` and `min_rank()` compare the value of `rankme$x_min_rank` and `rankme$x_dense_rank` for `x = 10`.

If I had to choose one for presenting rankings to someone else, I would use `min_rank()` since its results correspond to the most common usage of rankings in sports or other competitions.
In the code below, I use all three functions, but since there are no ties in the top 10 flights, the results don't differ.

```{r}
flights_delayed <- mutate(flights, 
                          dep_delay_min_rank = min_rank(desc(dep_delay)),
                          dep_delay_row_number = row_number(desc(dep_delay)),
                          dep_delay_dense_rank = dense_rank(desc(dep_delay))
                          )
flights_delayed <- filter(flights_delayed, 
                          !(dep_delay_min_rank > 10 | dep_delay_row_number > 10 |
                              dep_delay_dense_rank > 10))
flights_delayed <- arrange(flights_delayed, dep_delay_min_rank)
print(select(flights_delayed, month, day, carrier, flight, dep_delay, 
             dep_delay_min_rank, dep_delay_row_number, dep_delay_dense_rank), 
      n = Inf)
```

In addition to the functions covered here, the `rank()` function provides several more ways of ranking elements.

There are other ways to solve this problem that do not using ranking functions.
To select the top 10, sort values with `arrange()` and select the top values with `slice`:
```{r}
flights_delayed2 <- arrange(flights, desc(dep_delay))
flights_delayed2 <- slice(flights_delayed2, 1:10)
select(flights_delayed2,  month, day, carrier, flight, dep_delay)
```
Alternatively, we could use the `top_n()`.
```{r}
flights_delayed3 <- top_n(flights, 10, dep_delay)
flights_delayed3 <- arrange(flights_delayed3, desc(dep_delay))
select(flights_delayed3, month, day, carrier, flight, dep_delay)
```
 
The previous two approaches will always select 10 rows even if there are tied values. 
Ranking functions provide more control over how tied values are handled.
Those approaches will provide the 10 rows with the largest values of `dep_delay`, while ranking functions can provide all rows with the 10 largest values of `dep_delay`. 
If there are no ties, these approaches are equivalent.
If there are ties, then which is more appropriate depends on the use.

</div>

### Exercise 5.5.5 {.unnumbered .exercise data-number="5.5.5"}

<div class="question">
What does `1:3 + 1:10` return? Why?
</div>

<div class="answer">

The code given in the question returns the following.
```{r warning=TRUE}
1:3 + 1:10
```
This is equivalent to the following.
```{r}
c(1 + 1, 2 + 2, 3 + 3, 1 + 4, 2 + 5, 3 + 6, 1 + 7, 2 + 8, 3 + 9, 1 + 10)
```
When adding two vectors, R recycles the shorter vector's values to create a vector of the same length as the longer vector.
The code also raises a warning that the shorter vector is not a multiple of the longer vector.
A warning is raised since when this occurs, it is often unintended and may be a bug.

</div>

### Exercise 5.5.6 {.unnumbered .exercise data-number="5.5.6"}

<div class="question">
What trigonometric functions does R provide?
</div>

<div class="answer">

All trigonometric functions are all described in a single help page, named `Trig`.
You can open the documentation for these functions with `?Trig` or by using `?` with any of the following functions, for example:`?sin`.

R provides functions for the three primary trigonometric functions: sine (`sin()`), cosine (`cos()`), and tangent (`tan()`).
The input angles to all these functions are in [radians](https://en.wikipedia.org/wiki/Radian).
```{r}
x <- seq(-3, 7, by = 1 / 2)
sin(pi * x)
cos(pi * x)
tan(pi * x)
```

In the previous code, I used the variable `pi`.
R provides the variable `pi` which is set to the value of the mathematical constant $\pi$ . [^pi]
```{r}
pi
```
Although R provides the `pi` variable, there is nothing preventing a user from changing its value.
For example, I could redefine `pi` to [3.14](https://en.wikipedia.org/wiki/Indiana_Pi_Bill) or 
any other value.
```{r}
pi <- 3.14
pi
pi <- "Apple"
pi
```
For that reason, if you are using the builtin `pi` variable in computations and are paranoid, you may want to always reference it as `base::pi`.
```{r}
base::pi
```
```{r include=FALSE}
# reset value of pi
rm(pi)
```

In the previous code block, since the angles were in radians, I wrote them as $\pi$ times some number.
Since it is often easier to write radians multiple of $\pi$, R provides some convenience functions that do that. 
The function `sinpi(x)`, is equivalent to `sin(pi * x)`.
The functions `cospi()` and `tanpi()` are similarly defined for the sin and tan functions, respectively.
```{r}
sinpi(x)
cospi(x)
tanpi(x)
```

R provides the function arc-cosine (`acos()`), arc-sine (`asin()`), and arc-tangent (`atan()`).
```{r}
x <- seq(-1, 1, by = 1 / 4)
acos(x)
asin(x)
atan(x)
```

Finally, R provides the function `atan2()`.
Calling `atan2(y, x)` returns the angle between the x-axis and the vector from `(0,0)` to `(x, y)`.
```{r}
atan2(c(1, 0, -1, 0), c(0, 1, 0, -1))
```

</div>

## Grouped summaries with `summarise()` {#grouped-summaries-with-summarise .r4ds-section}

### Exercise 5.6.1 {.unnumbered .exercise data-number="5.6.1"}

<div class="question">
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. 
Consider the following scenarios:

-   A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
-   A flight is always 10 minutes late.
-   A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
-   99% of the time a flight is on time. 1% of the time it’s 2 hours late.

Which is more important: arrival delay or departure delay?

</div>

<div class="answer">

What this question gets at is a fundamental question of data analysis: the cost function.
As analysts, the reason we are interested in flight delay because it is costly to passengers.
But it is worth thinking carefully about how it is costly and use that information in ranking and measuring these scenarios.

In many scenarios, arrival delay is more important.
In most cases, being arriving late is more costly to the passenger since it could disrupt the next stages of their travel, such as connecting flights or scheduled meetings.  
If a departure is delayed without affecting the arrival time, this delay will not have those affects plans nor does it affect the total time spent traveling.
This delay could be beneficial, if less time is spent in the cramped confines of the airplane itself, or a negative, if that delayed time is still spent in the cramped confines of the airplane on the runway.

Variation in arrival time is worse than consistency.
If a flight is always 30 minutes late and that delay is known, then it is as if the arrival time is that delayed time.
The traveler could easily plan for this. 
But higher variation in flight times makes it harder to plan.

<!-- 
**TODO** (Add a better explanation and some examples)
-->

</div>

### Exercise 5.6.2 {.unnumbered .exercise data-number="5.6.2"}

<div class="question">

Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).
</div>

<div class="answer">

```{r not_cancelled}
not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))
```

The first expression is the following.

```{r}
not_cancelled %>% 
  count(dest)
```

The `count()` function counts the number of instances within each group of variables.
Instead of using the `count()` function, we can combine the `group_by()` and `summarise()` verbs.

```{r}
not_cancelled %>%
  group_by(dest) %>%
  summarise(n = length(dest))
```

An alternative method for getting the number of observations in a data frame is the function `n()`.

```{r}
not_cancelled %>%
  group_by(dest) %>%
  summarise(n = n())
```

Another alternative to `count()` is to use `group_by()` followed by `tally()`.
In fact, `count()` is effectively a short-cut for `group_by()` followed by `tally()`.

```{r}
not_cancelled %>%
  group_by(tailnum) %>%
  tally()
```

The second expression also uses the `count()` function, but adds a `wt` argument.

```{r}
not_cancelled %>% 
  count(tailnum, wt = distance)
```

As before, we can replicate `count()` by combining the `group_by()` and `summarise()` verbs.
But this time instead of using `length()`, we will use `sum()` with the weighting variable.

```{r}
not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))
```

Like the previous example, we can also use the combination `group_by()` and `tally()`.
Any arguments to `tally()` are summed.

```{r}
not_cancelled %>%
  group_by(tailnum) %>%
  tally(distance)
```

</div>

### Exercise 5.6.3 {.unnumbered .exercise data-number="5.6.3"}

<div class="question">
Our definition of cancelled flights `(is.na(dep_delay) | is.na(arr_delay))` is slightly suboptimal. 
Why? 
Which is the most important column?
</div>

<div class="answer">

If a flight never departs, then it won't arrive.
A flight could also depart and not arrive if it crashes, or if it is redirected and lands in an airport other than its intended destination.
So the most important column is `arr_delay`, which indicates the amount of delay in arrival.
```{r}
filter(flights, !is.na(dep_delay), is.na(arr_delay)) %>%
  select(dep_time, arr_time, sched_arr_time, dep_delay, arr_delay)
```

In this data `dep_time` can be non-missing and `arr_delay` missing but `arr_time` not missing.
Some further [research](https://hyp.is/TsdRpofJEeqzs6-vUOfVBg/jrnold.github.io/r4ds-exercise-solutions/transform.html) found that these rows correspond to diverted flights.
The [BTS](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236) database that is the source for the `flights` table contains additional information for diverted flights that is not included in the nycflights13 data. 
The source contains a column `DivArrDelay` with the description:

> Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. 
> The `ArrDelay` column remains `NULL` for all diverted flights.

</div>

### Exercise 5.6.4 {.unnumbered .exercise data-number="5.6.4"}

<div class="question">
Look at the number of cancelled flights per day. 
Is there a pattern?
Is the proportion of cancelled flights related to the average delay?
</div>

<div class="answer">

One pattern in cancelled flights per day is that the number of cancelled flights increases with the total number of flights per day.
The proportion of cancelled flights increases with the average delay of flights.

To answer these questions, use definition of cancelled used in the 
chapter [Section 5.6.3](https://r4ds.had.co.nz/transform.html#counts) and the
relationship `!(is.na(arr_delay) & is.na(dep_delay))` is equal to 
`!is.na(arr_delay) | !is.na(dep_delay)` by [De Morgan's law](https://en.wikipedia.org/wiki/De_Morgan%27s_laws).

The first part of the question asks for any pattern in the number of cancelled flights per day.
I'll look at the relationship between the number of cancelled flights per day and the total number of flights in a day.
There should be an increasing relationship for two reasons.
First, if all flights are equally likely to be cancelled, then days with more flights should have a higher number of cancellations.
Second, it is likely that days with more flights would have a higher probability of cancellations because congestion itself can cause delays and any delay would affect more flights, and large delays can lead to cancellations.
```{r}
cancelled_per_day <- 
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(
    cancelled_num = sum(cancelled),
    flights_num = n(),
  )
```
Plotting `flights_num` against `cancelled_num` shows that the number of flights
cancelled increases with the total number of flights.
```{r}
ggplot(cancelled_per_day) +
  geom_point(aes(x = flights_num, y = cancelled_num)) 

```

The second part of the question asks whether there is a relationship between the proportion of flights cancelled and the average departure delay.
I implied this in my answer to the first part of the question, when I noted that increasing delays could result in increased cancellations.
The question does not specify which delay, so I will show the relationship for both.
```{r}
cancelled_and_delays <- 
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(
    cancelled_prop = mean(cancelled),
    avg_dep_delay = mean(dep_delay, na.rm = TRUE),
    avg_arr_delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  ungroup()
```

There is a strong increasing relationship between both average departure delay and  
and average arrival delay and the proportion of cancelled flights.
```{r}
ggplot(cancelled_and_delays) +
  geom_point(aes(x = avg_dep_delay, y = cancelled_prop))

```

```{r}
ggplot(cancelled_and_delays) +
  geom_point(aes(x = avg_arr_delay, y = cancelled_prop))

```

</div>

### Exercise 5.6.5 {.unnumbered .exercise data-number="5.6.5"}

<div class="question">
Which carrier has the worst delays? 
Challenge: can you disentangle the effects of bad airports vs. bad carriers? 
Why/why not? 
(Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
</div>

<div class="answer">

```{r}
flights %>%
  group_by(carrier) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(arr_delay))
```

What airline corresponds to the `"F9"` carrier code?
```{r}
filter(airlines, carrier == "F9")
```

You can get part of the way to disentangling the effects of airports versus bad carriers by comparing the average delay of each carrier to the average delay of flights within a route (flights from the same origin to the same destination).
Comparing delays between carriers and within each route disentangles the effect of carriers and airports.
A better analysis would compare the average delay of a carrier's flights to the average delay of *all other* carrier's flights within a route.

```{r}
flights %>%
  filter(!is.na(arr_delay)) %>%
  # Total delay by carrier within each origin, dest
  group_by(origin, dest, carrier) %>%
  summarise(
    arr_delay = sum(arr_delay),
    flights = n()
  ) %>%
  # Total delay within each origin dest
  group_by(origin, dest) %>%
  mutate(
    arr_delay_total = sum(arr_delay),
    flights_total = sum(flights)
  ) %>%
  # average delay of each carrier - average delay of other carriers
  ungroup() %>%
  mutate(
    arr_delay_others = (arr_delay_total - arr_delay) /
      (flights_total - flights),
    arr_delay_mean = arr_delay / flights,
    arr_delay_diff = arr_delay_mean - arr_delay_others
  ) %>%
  # remove NaN values (when there is only one carrier)
  filter(is.finite(arr_delay_diff)) %>%
  # average over all airports it flies to
  group_by(carrier) %>%
  summarise(arr_delay_diff = mean(arr_delay_diff)) %>%
  arrange(desc(arr_delay_diff))
```

There are more sophisticated ways to do this analysis, however comparing the delay of flights within each route goes a long ways toward disentangling airport and carrier effects.
To see a more complete example of this analysis, see this FiveThirtyEight [piece](https://fivethirtyeight.com/features/the-best-and-worst-airlines-airports-and-flights-summer-2015-update/).

</div>

### Exercise 5.6.6 {.unnumbered .exercise data-number="5.6.6"}

<div class="question">

What does the sort argument to `count()` do?
When might you use it?

</div>

<div class="answer">

The sort argument to `count()` sorts the results in order of `n`.
You could use this anytime you would run `count()` followed by `arrange()`.

For example, the following expression counts the number of flights to a destination and sorts the returned data from highest to lowest.
```{r}
flights %>%
  count(dest, sort = TRUE)
```

</div>

## Grouped mutates (and filters) {#grouped-mutates-and-filters .r4ds-section}

### Exercise 5.7.1 {.unnumbered .exercise data-number="5.7.1"}

<div class="question">

Refer back to the lists of useful mutate and filtering functions.
Describe how each operation changes when you combine it with grouping.

</div>

<div class="answer">

Summary functions (`mean()`), offset functions (`lead()`, `lag()`), ranking functions (`min_rank()`, `row_number()`), operate within each group when used with `group_by()` in 
`mutate()` or `filter()`.
Arithmetic operators (`+`, `-`), logical operators (`<`, `==`), modular arithmetic operators (`%%`, `%/%`), logarithmic functions (`log`) are not affected by `group_by`.

Summary functions like `mean()`, `median()`, `sum()`, `std()` and others covered
in the section [Useful Summary Functions](https://r4ds.had.co.nz/transform.html#summarise-funs) 
calculate their values within each group when used with `mutate()` or `filter()` and `group_by()`.
```{r}
tibble(x = 1:9,
       group = rep(c("a", "b", "c"), each = 3)) %>%
  mutate(x_mean = mean(x)) %>%
  group_by(group) %>%
  mutate(x_mean_2 = mean(x))
```

Arithmetic operators `+`, `-`, `*`, `/`, `^` are not affected by `group_by()`.
```{r}
tibble(x = 1:9,
       group = rep(c("a", "b", "c"), each = 3)) %>%
  mutate(y = x + 2) %>%
  group_by(group) %>%
  mutate(z = x + 2)
```

The modular arithmetic operators `%/%` and `%%` are not affected by `group_by()`
```{r}
tibble(x = 1:9,
       group = rep(c("a", "b", "c"), each = 3)) %>%
  mutate(y = x %% 2) %>%
  group_by(group) %>%
  mutate(z = x %% 2)
```

The logarithmic functions `log()`, `log2()`, and `log10()` are not affected by
`group_by()`.
```{r}
tibble(x = 1:9,
       group = rep(c("a", "b", "c"), each = 3)) %>%
  mutate(y = log(x)) %>%
  group_by(group) %>%
  mutate(z = log(x))
```

The offset functions `lead()` and `lag()` respect the groupings in `group_by()`.
The functions `lag()` and `lead()` will only return values within each group.
```{r}
tibble(x = 1:9,
       group = rep(c("a", "b", "c"), each = 3)) %>%
  group_by(group) %>%
  mutate(lag_x = lag(x),
         lead_x = lead(x))
```

The cumulative and rolling aggregate functions `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, and `cummean()` calculate values within each group.
```{r}
tibble(x = 1:9,
       group = rep(c("a", "b", "c"), each = 3)) %>%
  mutate(x_cumsum = cumsum(x)) %>%
  group_by(group) %>%
  mutate(x_cumsum_2 = cumsum(x))

```

Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==` are not affected by `group_by()`.
```{r}
tibble(x = 1:9,
       y = 9:1,
       group = rep(c("a", "b", "c"), each = 3)) %>%
  mutate(x_lte_y = x <= y) %>%
  group_by(group) %>%
  mutate(x_lte_y_2 = x <= y)
```

Ranking functions like `min_rank()` work within each group when used with `group_by()`.
```{r}
tibble(x = 1:9,
       group = rep(c("a", "b", "c"), each = 3)) %>%
  mutate(rnk = min_rank(x)) %>%
  group_by(group) %>%
  mutate(rnk2 = min_rank(x))
```

Though not asked in the question, note that `arrange()` ignores groups when sorting values.
```{r}
tibble(x = runif(9),
       group = rep(c("a", "b", "c"), each = 3)) %>%
  group_by(group) %>%
  arrange(x)

```
However, the order of values from `arrange()` can interact with groups when 
used with functions that rely on the ordering of elements, such as `lead()`, `lag()`,
or `cumsum()`.
```{r}
tibble(group = rep(c("a", "b", "c"), each = 3), 
       x = runif(9)) %>%
  group_by(group) %>%
  arrange(x) %>%
  mutate(lag_x = lag(x))

```

</div>

### Exercise 5.7.2 {.unnumbered .exercise data-number="5.7.2"}

<div class="question">
Which plane (`tailnum`) has the worst on-time record?
</div>

<div class="answer">

The question does not define a way to measure on-time record, so I will consider two metrics:

1.  proportion of flights not delayed or cancelled, and
1.  mean arrival delay.

The first metric is the proportion of not-cancelled and on-time flights.
I use the presence of an arrival time as an indicator that a flight was not cancelled.
However, there are many planes that have never flown an on-time flight.
Additionally, many of the planes that have the lowest proportion of on-time flights have only flown a small number of flights.
```{r}
flights %>%
  filter(!is.na(tailnum)) %>%
  mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) %>%
  group_by(tailnum) %>%
  summarise(on_time = mean(on_time), n = n()) %>%
  filter(min_rank(on_time) == 1)

```

So, I will remove planes that flew at least 20 flights.
The choice of 20 was chosen because it round number near the first quartile of the number of flights by plane.[^delay][^count]
```{r}
quantile(count(flights, tailnum)$n)
```

The plane with the worst on time record that flew at least 20 flights is:
```{r}
flights %>%
  filter(!is.na(tailnum), is.na(arr_time) | !is.na(arr_delay)) %>%
  mutate(on_time = !is.na(arr_time) & (arr_delay <= 0)) %>%
  group_by(tailnum) %>%
  summarise(on_time = mean(on_time), n = n()) %>%
  filter(n >= 20) %>%
  filter(min_rank(on_time) == 1)

```

There are cases where `arr_delay` is missing but `arr_time` is not missing.
I have not debugged the cause of this bad data, so these rows are dropped for
the purposes of this exercise.

The second metric is the mean minutes delayed.
As with the previous metric, I will only consider planes which flew least 20 flights.
A different plane has the worst on-time record when measured as average minutes delayed.
```{r}
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(tailnum) %>%
  summarise(arr_delay = mean(arr_delay), n = n()) %>%
  filter(n >= 20) %>%
  filter(min_rank(desc(arr_delay)) == 1)
```

</div>

### Exercise 5.7.3 {.unnumbered .exercise data-number="5.7.3"}

<div class="question">
What time of day should you fly if you want to avoid delays as much as possible?
</div>

<div class="answer">

Let's group by the hour of the flight.
The earlier the flight is scheduled, the lower its expected delay.
This is intuitive as delays will affect later flights. 
Morning flights have fewer (if any) previous flights that can delay them.

```{r}
flights %>%
  group_by(hour) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(arr_delay)
```

</div>

### Exercise 5.7.4 {.unnumbered .exercise data-number="5.7.4"}

<div class="question">

For each destination, compute the total minutes of delay. 
For each flight, compute the proportion of the total delay for its destination.

</div>

<div class="answer">

The key to answering this question is to only include delayed flights when calculating the total delay and proportion of delay.

```{r}
flights %>%
  filter(arr_delay > 0) %>%
  group_by(dest) %>%
  mutate(
    arr_delay_total = sum(arr_delay),
    arr_delay_prop = arr_delay / arr_delay_total
  ) %>%
  select(dest, month, day, dep_time, carrier, flight,
         arr_delay, arr_delay_prop) %>%
  arrange(dest, desc(arr_delay_prop))
```

There is some ambiguity in the meaning of the term *flights* in the question.
The first example defined a flight as a row in the `flights` table, which is a trip by an aircraft from an airport at a particular date and time. 
However, *flight* could also refer to the [flight number](https://en.wikipedia.org/wiki/Flight_number), which is the code a carrier uses for an airline service of a route.
For example, `AA1` is the flight number of the 09:00 American Airlines flight between JFK and LAX.
The flight number is contained in the `flights$flight` column, though what is called a "flight" is a combination of the `flights$carrier` and `flights$flight` columns.

```{r}
flights %>%
  filter(arr_delay > 0) %>%
  group_by(dest, origin, carrier, flight) %>%
  summarise(arr_delay = sum(arr_delay)) %>%
  group_by(dest) %>%
  mutate(
    arr_delay_prop = arr_delay / sum(arr_delay)
  ) %>%
  arrange(dest, desc(arr_delay_prop)) %>%
  select(carrier, flight, origin, dest, arr_delay_prop)
```

</div>

### Exercise 5.7.5 {.unnumbered .exercise data-number="5.7.5"}

<div class="question">
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using `lag()` explore how the delay of a flight is related to the delay of the immediately preceding flight.
</div>

<div class="answer">

This calculates the departure delay of the preceding flight from the same airport.
```{r}
lagged_delays <- flights %>%
  arrange(origin, month, day, dep_time) %>%
  group_by(origin) %>%
  mutate(dep_delay_lag = lag(dep_delay)) %>%
  filter(!is.na(dep_delay), !is.na(dep_delay_lag))
```

This plots the relationship between the mean delay of a flight for all values of the previous flight.
For delays less than two hours, the relationship between the delay of the preceding flight and the current flight is nearly a line.
After that the relationship becomes more variable, as long-delayed flights are interspersed with flights leaving on-time.
After about 8-hours, a delayed flight is likely to be followed by a flight leaving on time.

```{r message=FALSE}
lagged_delays %>%
  group_by(dep_delay_lag) %>%
  summarise(dep_delay_mean = mean(dep_delay)) %>%
  ggplot(aes(y = dep_delay_mean, x = dep_delay_lag)) +
  geom_point() +
  scale_x_continuous(breaks = seq(0, 1500, by = 120)) +
  labs(y = "Departure Delay", x = "Previous Departure Delay")
```

The overall relationship looks similar in all three origin airports.
```{r}
lagged_delays %>%
  group_by(origin, dep_delay_lag) %>%
  summarise(dep_delay_mean = mean(dep_delay)) %>%
  ggplot(aes(y = dep_delay_mean, x = dep_delay_lag)) +
  geom_point() +
  facet_wrap(~ origin, ncol=1) +
  labs(y = "Departure Delay", x = "Previous Departure Delay")
```

</div>

### Exercise 5.7.6 {.unnumbered .exercise data-number="5.7.6"}

<div class="question">
Look at each destination. Can you find flights that are suspiciously fast? 
(i.e. flights that represent a potential data entry error).
Compute the air time of a flight relative to the shortest flight to that destination.
Which flights were most delayed in the air?
</div>

<div class="answer">

When calculating this answer we should only compare flights within the same (origin, destination) pair.

To find unusual observations, we need to first put them on the same scale.
I will [standardize](https://en.wikipedia.org/wiki/Standard_score)
values by subtracting the mean from each and then dividing each by the standard deviation.
$$
\mathsf{standardized}(x) = \frac{x - \mathsf{mean}(x)}{\mathsf{sd}(x)} .
$$
A standardized variable is often called a $z$-score.
The units of the standardized variable are standard deviations from the mean.
This will put the flight times from different routes on the same scale.
The larger the magnitude of the standardized variable for an observation, the more unusual the observation is.
Flights with negative values of the standardized variable are faster than the 
mean flight for that route, while those with positive values are slower than
the mean flight for that route.

```{r}
standardized_flights <- flights %>%
  filter(!is.na(air_time)) %>%
  group_by(dest, origin) %>%
  mutate(
    air_time_mean = mean(air_time),
    air_time_sd = sd(air_time),
    n = n()
  ) %>%
  ungroup() %>%
  mutate(air_time_standard = (air_time - air_time_mean) / (air_time_sd + 1))
```
I add 1 to the denominator and numerator to avoid dividing by zero.
Note that the `ungroup()` here is not necessary. However, I will be using 
this data frame later. Through experience, I have found that I have fewer bugs
when I keep a data frame grouped for only those verbs that need it.
If I did not `ungroup()` this data frame, the `arrange()` used later would 
not work as expected.  It is better to err on the side of using `ungroup()`
when unnecessary.

The distribution of the standardized air flights has long right tail.
```{r}
ggplot(standardized_flights, aes(x = air_time_standard)) +
  geom_density()
```

Unusually fast flights are those flights with the smallest standardized values.
```{r}
standardized_flights %>%
  arrange(air_time_standard) %>%
  select(
    carrier, flight, origin, dest, month, day,
    air_time, air_time_mean, air_time_standard
  ) %>%
  head(10) %>%
  print(width = Inf)
```

I used `width = Inf` to ensure that all columns will be printed.

```{r include=FALSE,purl=FALSE}
format_ymd <- function(y, m, d) {
  format(lubridate::make_date(y, m, d), "%Y-%m-%d")
}
format_time <- function(x) {
  format(
    lubridate::make_datetime(hour = x %/% 100, min = x %% 100),
    "%H:%M"
  )
}
fastest_flight <- standardized_flights %>%
  arrange(air_time_standard) %>%
  slice(1L) %>%
  mutate(
    date = format_ymd(year, month, day),
    time = format_time(dep_time),
    flightnum = str_c(carrier, flight)
  )
```
The fastest flight is `r fastest_flight$flightnum` from `r fastest_flight$origin` to
`r fastest_flight$dest` which departed on 
`r fastest_flight$date` at `r fastest_flight$time`.
It has an air time of `r round(fastest_flight$air_time)` minutes, compared to an average
flight time of `r round(fastest_flight$air_time_mean)` minutes for its route.
This is `r round(abs(fastest_flight$air_time_standard), 1)` standard deviations below
the average flight on its route.

It is important to note that this does not necessarily imply that there was a data entry error. 
We should check these flights to see whether there was some reason for the difference.
It may be that we are missing some piece of information that explains these unusual times.

A potential issue with the way that we standardized the flights is that the mean and standard deviation used to calculate are sensitive to outliers and outliers is what we are looking for.
Instead of standardizing variables with the mean and variance, we could use the median
as a measure of central tendency and the interquartile range (IQR) as a measure of spread.
The median and IQR are more [resistant to outliers](https://en.wikipedia.org/wiki/Robust_statistics) than the mean and standard deviation.
The following method uses the median and inter-quartile range, which are less sensitive to outliers.

```{r}
standardized_flights2 <- flights %>%
  filter(!is.na(air_time)) %>%
  group_by(dest, origin) %>%
  mutate(
    air_time_median = median(air_time),
    air_time_iqr = IQR(air_time),
    n = n(),
    air_time_standard = (air_time - air_time_median) / air_time_iqr)
```

The distribution of the standardized air flights using this new definition
also has long right tail of slow flights.
```{r}
ggplot(standardized_flights2, aes(x = air_time_standard)) +
  geom_density()
```

Unusually fast flights are those flights with the smallest standardized values.
```{r}
standardized_flights2 %>%
  arrange(air_time_standard) %>%
  select(
    carrier, flight, origin, dest, month, day, air_time,
    air_time_median, air_time_standard
  ) %>%
  head(10) %>%
  print(width = Inf)
```

All of these answers have relied only on using a distribution of comparable observations to find unusual observations. 
In this case, the comparable observations were flights from the same origin to the same destination.
Apart from our knowledge that flights from the same origin to the same destination should have similar air times, we have not used any other domain-specific knowledge.
But we know much more about this problem.
The most obvious piece of knowledge we have is that we know that flights cannot travel back in time, so there should never be a flight with a negative airtime.
But we also know that aircraft have maximum speeds.
While different aircraft have different [cruising speeds](https://en.wikipedia.org/wiki/Cruise_(aeronautics)), commercial airliners
typically cruise at air speeds around 547–575 mph.
Calculating the ground speed of aircraft is complicated by the way in which winds, especially the influence of wind, especially jet streams, on the ground-speed of flights.
A strong tailwind can increase ground-speed of the aircraft by [200 mph](https://www.wired.com/story/norwegian-air-transatlantic-speed-record/).
Apart from the retired [Concorde](https://en.wikipedia.org/wiki/Concorde).
For example, in 2018, [a transatlantic flight](https://www.wired.com/story/norwegian-air-transatlantic-speed-record/) 
traveled at 770 mph due to a strong jet stream tailwind.
This means that any flight traveling at speeds greater than 800 mph is implausible, 
and it may be worth checking flights traveling at greater than 600 or 700 mph.
Ground speed could also be used to identify aircraft flying implausibly slow.
Joining flights data with the air craft type in the `planes` table and getting
information about typical or top speeds of those aircraft could provide a more 
detailed way to identify implausibly fast or slow flights.
Additional data on high altitude wind speeds at the time of the flight would further help.

Knowing the substance of the data analysis at hand is one of the most important 
tools of a data scientist. The tools of statistics are a complement, not a 
substitute, for that knowledge.

With that in mind, Let's plot the distribution of the ground speed of flights. 
The modal flight in this data has a ground speed of between 400 and 500 mph.
The distribution of ground speeds has a large left tail of slower flights below
400 mph constituting the majority.
There are very few flights with a ground speed over 500 mph.

```{r}
flights %>%
  mutate(mph = distance / (air_time / 60)) %>%
  ggplot(aes(x = mph)) +
  geom_histogram(binwidth = 10)
```

```{r include=FALSE,purl=FALSE}
flights_mph <- flights %>%
  mutate(mph = distance / (air_time / 60))
fastest_flight_mph <- arrange(ungroup(flights_mph), desc(mph)) %>%
  slice(1L)
over_600mph <- filter(flights_mph, mph > 600) %>% nrow()
```

The fastest flight is the same one identified as the largest outlier earlier.
Its ground speed was `r round(fastest_flight_mph$mph)` mph. 
This is fast for a commercial jet, but not impossible. 

```{r}
flights %>%
  mutate(mph = distance / (air_time / 60)) %>%
  arrange(desc(mph)) %>%
  select(mph, flight, carrier, flight, month, day, dep_time) %>%
  head(5)
```

One explanation for unusually fast flights is that they are "making up time" in the air by flying faster.
Commercial aircraft do not fly at their top speed since the airlines are also concerned about fuel consumption.
But, if a flight is delayed on the ground, it may fly faster than usual in order to avoid a late arrival.
So, I would expect that some of the unusually fast flights were delayed on departure.

```{r}
flights %>%
  mutate(mph = distance / (air_time / 60)) %>%
  arrange(desc(mph)) %>%
  select(
    origin, dest, mph, year, month, day, dep_time, flight, carrier,
    dep_delay, arr_delay
  )
head(5)
```

Five of the top ten flights had departure delays, and three of those were
able to make up that time in the air and arrive ahead of schedule.

Overall, there were a few flights that seemed unusually fast, but they all 
fall into the realm of plausibility and likely are not data entry problems.
[Ed. Please correct me if I am missing something]

<!--
Similarly, the longest [regularly scheduled flight](https://en.wikipedia.org/wiki/Longest_flights#Record_flights) is Newark
to Signapore with a duration of 18 hours 30--45 minutes.
Thus, we should never observe any commercial flight longer than 19 hours in our
data.
-->

The second part of the question asks us to compare flights to the fastest flight
on a route to find the flights most delayed in the air. I will calculate the 
amount a flight is delayed in air in two ways. 
The first is the absolute delay, defined as the number of minutes longer than the fastest flight on that route,`air_time - min(air_time)`. 
The second is the relative delay, which is the percentage increase in air time relative to the time of the fastest flight
along that route, `(air_time - min(air_time)) / min(air_time) * 100`.

```{r}
air_time_delayed <-
  flights %>%
  group_by(origin, dest) %>%
  mutate(
    air_time_min = min(air_time, na.rm = TRUE),
    air_time_delay = air_time - air_time_min,
    air_time_delay_pct = air_time_delay / air_time_min * 100
  )
```
```{r include=FALSE,purl=FALSE}
most_delayed <- arrange(
  air_time_delayed,
  desc(air_time_delay)
) %>%
  mutate(
    date = format_ymd(year, month, day),
    time = format_time(dep_time),
    carrier_flight = str_c(carrier, flight)
  ) %>%
  head(1L)

most_delayed_pct <- arrange(air_time_delayed, desc(air_time_delay_pct)) %>%
  mutate(
    date = format_ymd(year, month, day),
    time = format_time(dep_time),
    carrier_flight = str_c(carrier, flight)
  ) %>%
  head(1L)
```

The most delayed flight in air in minutes was `r most_delayed$carrier_flight` 
from `r most_delayed$origin` to `r most_delayed$dest` which departed on 
`r most_delayed$date` at `r most_delayed$time`. It took
`r most_delayed$air_time_delay` minutes longer than the flight with the shortest
air time on its route.

```{r}
air_time_delayed %>%
  arrange(desc(air_time_delay)) %>%
  select(
    air_time_delay, carrier, flight,
    origin, dest, year, month, day, dep_time,
    air_time, air_time_min
  ) %>%
  head() %>%
  print(width = Inf)
```

The most delayed flight in air as a percentage of the fastest flight along that
route was `r most_delayed_pct$carrier_flight` 
from `r most_delayed_pct$origin` to `r most_delayed_pct$dest` departing on `r most_delayed_pct$date` at `r most_delayed_pct$time`.
It took `r round(most_delayed_pct$air_time_delay_pct)`% longer than the 
flight with the shortest air time on its route.

```{r}
air_time_delayed %>%
  arrange(desc(air_time_delay)) %>%
  select(
    air_time_delay_pct, carrier, flight,
    origin, dest, year, month, day, dep_time,
    air_time, air_time_min
  ) %>%
  head() %>%
  print(width = Inf)
```

</div>

### Exercise 5.7.7 {.unnumbered .exercise data-number="5.7.7"}

<div class="question">
Find all destinations that are flown by at least two carriers. 
Use that information to rank the carriers.
</div>

<div class="answer">

To restate this question, we are asked to rank airlines by the number of destinations that they fly to, considering only those airports that are flown to by two or more airlines.
There are two steps to calculating this ranking.
First, find all airports serviced by two or more carriers.
Then, rank carriers by the number of those destinations that they service.

```{r}
flights %>%
   # find all airports with > 1 carrier
   group_by(dest) %>%
   mutate(n_carriers = n_distinct(carrier)) %>%
   filter(n_carriers > 1) %>%
   # rank carriers by numer of destinations
   group_by(carrier) %>%
   summarize(n_dest = n_distinct(dest)) %>%
   arrange(desc(n_dest))
```

The carrier `"EV"` flies to the most destinations, considering only airports flown to by two or more carriers. What airline does the `"EV"` carrier code correspond to?
```{r}
filter(airlines, carrier == "EV")
```
Unless you know the airplane industry, it is likely that you don't recognize [ExpressJet](https://en.wikipedia.org/wiki/ExpressJet); I certainly didn't.
It is a regional airline that partners with major airlines to fly from hubs (larger airports) to smaller airports.
This means that many of the shorter flights of major carriers are operated by ExpressJet.
This business model explains why ExpressJet services the most destinations.

Among the airlines that fly to only one destination from New York are Alaska Airlines
and Hawaiian Airlines.
```{r}
filter(airlines, carrier %in% c("AS", "F9", "HA"))
```

</div>

### Exercise 5.7.8 {.unnumbered .exercise data-number="5.7.8"}

<div class="question">
For each plane, count the number of flights before the first delay of greater than 1 hour.
</div>

<div class="answer">

The question does not specify arrival or departure delay.
I consider `dep_delay` in this answer, though similar code could be used for `arr_delay`.

```{r}
flights %>%
  # sort in increasing order
  select(tailnum, year, month,day, dep_delay) %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum) %>%
  # cumulative number of flights delayed over one hour
  mutate(cumulative_hr_delays = cumsum(dep_delay > 60)) %>%
  # count the number of flights == 0
  summarise(total_flights = sum(cumulative_hr_delays < 1)) %>%
  arrange(total_flights)
```

</div>

[^daylight]: The exception is flights on the days on which daylight savings started (March 10) or
    ended (November 3). Since in the US, daylight savings goes into effect at 2 a.m.,
    and generally flights are not scheduled to depart between midnight and 2 a.m.,
    the only flights which would be scheduled to depart in Eastern Daylight Savings Time (Eastern Standard Time) time but departed in Eastern Standard Time (Eastern Daylight Savings Time), would have been scheduled before midnight, meaning they were delayed across days.
    If time zones seem annoying, it is not your imagination. They are.  
    I recommend this video, [The Problem with Time & Timezones - Computerphile](https://www.youtube.com/watch?v=-5wpm-gesOY).

[^pi]: Yes, technically, `base::pi` is an approximation of $\pi$ to seven digits of precision.
       Don't @ me.

[^delay]: We could address this issue using a statistical model, but that is outside
          the scope of this text.

[^count]: The `count()` function is introduced in [Chapter 5.6](https://r4ds.had.co.nz/transform.html#counts). It returns the count of 
          rows by group. In this case, the number of rows in `flights` for each
          `tailnum`. The data frame that `count()` returns has columns for the 
          groups, and a column `n`, which contains that count.