-
Notifications
You must be signed in to change notification settings - Fork 230
/
transform.Rmd
1952 lines (1580 loc) · 72.6 KB
/
transform.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: html_document
editor_options:
chunk_output_type: console
---
# Data transformation {#transform .r4ds-section}
## Introduction {#introduction-2 .r4ds-section}
```{r setup,message=FALSE,cache=FALSE}
library("nycflights13")
library("tidyverse")
```
## Filter rows with `filter()` {#filter-rows-with-filter .r4ds-section}
### Exercise 5.2.1 {.unnumbered .exercise data-number="5.2.1"}
<div class="question">
Find all flights that
1. Had an arrival delay of two or more hours
1. Flew to Houston (IAH or HOU)
1. Were operated by United, American, or Delta
1. Departed in summer (July, August, and September)
1. Arrived more than two hours late, but didn’t leave late
1. Were delayed by at least an hour, but made up over 30 minutes in flight
1. Departed between midnight and 6 am (inclusive)
</div>
<div class="answer">
The answer to each part follows.
1. Since the `arr_delay` variable is measured in minutes, find
flights with an arrival delay of 120 or more minutes.
```{r ex-5.2.1-1, indent = 4}
filter(flights, arr_delay >= 120)
```
1. The flights that flew to Houston are those flights where the
destination (`dest`) is either "IAH" or "HOU".
```{r ex-5.2.1-2, indent=4}
filter(flights, dest == "IAH" | dest == "HOU")
```
However, using `%in%` is more compact and would scale to cases where
there were more than two airports we were interested in.
```{r ex-5.2.1-3, indent=4}
filter(flights, dest %in% c("IAH", "HOU"))
```
1. In the `flights` dataset, the column `carrier` indicates the airline, but it uses two-character carrier codes.
We can find the carrier codes for the airlines in the `airlines` dataset.
Since the carrier code dataset only has `r nrow(airlines)` rows, and the names
of the airlines in that dataset are not exactly "United", "American", or "Delta",
it is easiest to manually look up their carrier codes in that data.
```{r ex-5.2.1-4,indent=4}
airlines
```
The carrier code for Delta is `"DL"`, for American is `"AA"`, and for United is `"UA"`.
Using these carriers codes, we check whether `carrier` is one of those.
```{r, indent=4}
filter(flights, carrier %in% c("AA", "DL", "UA"))
```
1. The variable `month` has the month, and it is numeric.
So, the summer flights are those that departed in months 7 (July), 8 (August), and 9 (September).
```{r, indent=4}
filter(flights, month >= 7, month <= 9)
```
The `%in%` operator is an alternative. If the `:` operator is used to specify
the integer range, the expression is readable and compact.
```{r, indent=4}
filter(flights, month %in% 7:9)
```
We could also use the `|` operator. However, the `|` does not scale to
many choices.
Even with only three choices, it is quite verbose.
```{r, indent=4}
filter(flights, month == 7 | month == 8 | month == 9)
```
We can also use the `between()` function as shown in [Exercise 5.2.2](#exercise-5.2.2).
1. Flights that arrived more than two hours late, but didn’t leave late will
have an arrival delay of more than 120 minutes (`arr_delay > 120`) and
a non-positive departure delay (`dep_delay <= 0`).
```{r, indent=4}
filter(flights, arr_delay > 120, dep_delay <= 0)
```
1. Were delayed by at least an hour, but made up over 30 minutes in flight.
If a flight was delayed by at least an hour, then `dep_delay >= 60`.
If the flight didn't make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning `dep_delay == arr_delay`, or alternatively, `dep_delay - arr_delay == 0`.
If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as `dep_delay - arr_delay > 30`.
```{r}
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
```
1. Finding flights that departed between midnight and 6 a.m. is complicated by
the way in which times are represented in the data.
In `dep_time`, midnight is represented by `2400`, not `0`.
You can verify this by checking the minimum and maximum of `dep_time`.
```{r}
summary(flights$dep_time)
```
This is an example of why it is always good to check the summary statistics of your data.
Unfortunately, this means we cannot simply check that `dep_time < 600`, because we also have
to consider the special case of midnight.
```{r}
filter(flights, dep_time <= 600 | dep_time == 2400)
```
Alternatively, we could use the [modulo operator](https://en.wikipedia.org/wiki/Modulo_operation), `%%`.
The modulo operator returns the remainder of division.
Let's see how this affects our times.
```{r}
c(600, 1200, 2400) %% 2400
```
Since `2400 %% 2400 == 0` and all other times are left unchanged,
we can compare the result of the modulo operation to `600`,
```{r}
filter(flights, dep_time %% 2400 <= 600)
```
This filter expression is more compact, but its readability depends on the
familiarity of the reader with modular arithmetic.
</div>
### Exercise 5.2.2 {.unnumbered .exercise data-number="5.2.2"}
<div class="question">
Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
</div>
<div class="answer">
The expression `between(x, left, right)` is equivalent to `x >= left & x <= right`.
Of the answers in the previous question, we could simplify the statement of *departed in summer* (`month >= 7 & month <= 9`) using the `between()` function.
```{r}
filter(flights, between(month, 7, 9))
```
</div>
### Exercise 5.2.3 {.unnumbered .exercise data-number="5.2.3"}
<div class="question">
How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?
</div>
<div class="answer">
Find the rows of flights with a missing departure time (`dep_time`) using the `is.na()` function.
```{r}
filter(flights, is.na(dep_time))
```
Notably, the arrival time (`arr_time`) is also missing for these rows. These seem to be cancelled flights.
The output of the function `summary()` includes the number of missing values for all non-character variables.
```{r}
summary(flights)
```
</div>
### Exercise 5.2.4 {.unnumbered .exercise data-number="5.2.4"}
<div class="question">
Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing?
Why is `FALSE & NA` not missing? Can you figure out the general rule?
(`NA * 0` is a tricky counterexample!)
</div>
<div class="answer">
```{r}
NA ^ 0
```
`NA ^ 0 == 1` since for all numeric values $x ^ 0 = 1$.
```{r}
NA | TRUE
```
`NA | TRUE` is `TRUE` because anything **or** `TRUE` is `TRUE`.
If the missing value were `TRUE`, then `TRUE | TRUE == TRUE`,
and if the missing value was `FALSE`, then `FALSE | TRUE == TRUE`.
```{r}
NA & FALSE
```
The value of `NA & FALSE` is `FALSE` because anything **and** `FALSE` is always `FALSE`.
If the missing value were `TRUE`, then `TRUE & FALSE == FALSE`,
and if the missing value was `FALSE`, then `FALSE & FALSE == FALSE`.
```{r}
NA | FALSE
```
For `NA | FALSE`, the value is unknown since `TRUE | FALSE == TRUE`, but `FALSE | FALSE == FALSE`.
```{r}
NA & TRUE
```
For `NA & TRUE`, the value is unknown since `FALSE & TRUE== FALSE`, but `TRUE & TRUE == TRUE`.
```{r}
NA * 0
```
Since $x * 0 = 0$ for all finite numbers we might expect `NA * 0 == 0`, but that's not the case.
The reason that `NA * 0 != 0` is that $0 \times \infty$ and $0 \times -\infty$ are undefined.
R represents undefined results as `NaN`, which is an abbreviation of "[not a number](https://en.wikipedia.org/wiki/NaN)".
```{r}
Inf * 0
-Inf * 0
```
</div>
## Arrange rows with `arrange()` {#arrange-rows-with-arrange .r4ds-section}
### Exercise 5.3.1 {.unnumbered .exercise data-number="5.3.1"}
<div class="question">
How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`).
</div>
<div class="answer">
The `arrange()` function puts `NA` values last.
```{r}
arrange(flights, dep_time) %>%
tail()
```
Using `desc()` does not change that.
```{r}
arrange(flights, desc(dep_time))
```
To put `NA` values first, we can add an indicator of whether the column has a missing value.
Then we sort by the missing indicator column and the column of interest.
For example, to sort the data frame by departure time (`dep_time`) in ascending order but `NA` values first, run the following.
```{r}
arrange(flights, desc(is.na(dep_time)), dep_time)
```
The `flights` will first be sorted by `desc(is.na(dep_time))`.
Since `desc(is.na(dep_time))` is either `TRUE` when `dep_time` is missing, or `FALSE`, when it is not, the rows with missing values of `dep_time` will come first, since `TRUE > FALSE`.
</div>
### Exercise 5.3.2 {.unnumbered .exercise data-number="5.3.2"}
<div class="question">
Sort flights to find the most delayed flights. Find the flights that left earliest.
</div>
<div class="answer">
```{r include=FALSE,purl=FALSE}
most_delayed <- filter(flights, dep_delay == max(dep_delay, na.rm = TRUE)) %>%
mutate(date = lubridate::make_datetime(
year, month, day,
sched_dep_time %/% 100,
sched_dep_time %% 100
))
left_earliest <- filter(flights, dep_delay == min(dep_delay, na.rm = TRUE)) %>%
mutate(date = lubridate::make_datetime(
year, month, day,
sched_dep_time %/% 100,
sched_dep_time %% 100
))
```
Find the most delayed flights by sorting the table by departure delay, `dep_delay`, in descending order.
```{r}
arrange(flights, desc(dep_delay))
```
The most delayed flight was `r most_delayed$carrier` `r most_delayed$flight`, `r most_delayed$origin` to `r most_delayed$dest`, which was scheduled to leave on `r format(most_delayed$date, "%B %d, %Y %H:%M")`.
Note that the departure time is given as `r most_delayed$dep_time`, which seems to be less than the scheduled departure time.
But the departure was delayed `r comma_int(most_delayed$dep_delay)` minutes, which is `r most_delayed$dep_delay %/% 60` hours, `r most_delayed$dep_delay %% 60` minutes.
The departure time is the day after the scheduled departure time.
Be happy that you weren't on that flight, and if you happened to have been on that flight and are reading this, I'm sorry for you.
Similarly, the earliest departing flight can be found by sorting `dep_delay` in ascending order.
```{r}
arrange(flights, dep_delay)
```
Flight `r left_earliest$carrier` `r left_earliest$flight` (`r left_earliest$origin` to `r left_earliest$dest`) scheduled to depart on `r format(left_earliest$date, "%B %d, %Y at %H:%M")`
departed `r comma_int(abs(left_earliest$dep_delay))` minutes early.
</div>
### Exercise 5.3.3 {.unnumbered .exercise data-number="5.3.3"}
<div class="question">
Sort flights to find the fastest flights.
</div>
<div class="answer">
There are actually two ways to interpret this question: one that can be solved by using `arrange()`, and a more complex interpretation that requires creation of a new variable using `mutate()`, which we haven't seen demonstrated before.
The colloquial interpretation of "fastest" flight can be understood to mean "the flight with the shortest flight time". We can use arrange to sort our data by the `air_time` variable to find the shortest flights:
```{r}
head(arrange(flights, air_time))
```
Another definition of the "fastest flight" is the flight with the highest average [ground speed](https://en.wikipedia.org/wiki/Ground_speed).
The ground speed is not included in the data, but it can be calculated from the `distance` and `air_time` of the flight.
```{r}
head(arrange(flights, desc(distance / air_time)))
```
<!-- note cannot use select() or mutate() in these answers since they are not introduced yet -->
</div>
### Exercise 5.3.4 {.unnumbered .exercise data-number="5.3.4"}
<div class="question">
Which flights traveled the longest?
Which traveled the shortest?
</div>
<div class="answer">
```{r include=FALSE,purl=FALSE}
longest <- filter(flights, distance == max(distance)) %>%
select(carrier, flight, origin, dest) %>%
distinct() %>%
slice(1)
shortest <- filter(flights, distance == min(distance)) %>%
select(carrier, flight, origin, dest) %>%
distinct() %>%
slice(1)
```
To find the longest flight, sort the flights by the `distance` column in descending order.
```{r}
arrange(flights, desc(distance))
```
The longest flight is `r longest$carrier` `r longest$flight`, `r longest$origin` to `r longest$dest`, which is `r comma_int(max(flights$distance))` miles.
To find the shortest flight, sort the flights by the `distance` in ascending order, which is the default sort order.
```{r}
arrange(flights, distance)
```
The shortest flight is `r shortest$carrier` `r shortest$flight`, `r shortest$origin` to `r shortest$dest`, which is only `r comma_int(min(flights$distance))` miles.
This is a flight between two of the New York area airports.
However, since this flight is missing a departure time so it either did not actually fly or there is a problem with the data.
The terms "longest" and "shortest" could also refer to the time of the flight instead of the distance.
Now the longest and shortest flights by can be found by sorting by the `air_time` column.
The longest flights by airtime are the following.
```{r}
arrange(flights, desc(air_time))
```
The shortest flights by airtime are the following.
```{r}
arrange(flights, air_time)
```
</div>
## Select columns with `select()` {#select .r4ds-section}
### Exercise 5.4.1 {.unnumbered .exercise data-number="5.4.1"}
<div class="question">
Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights.
</div>
<div class="answer">
These are a few ways to select columns.
- Specify columns names as unquoted variable names.
```{r}
select(flights, dep_time, dep_delay, arr_time, arr_delay)
```
- Specify column names as strings.
```{r}
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
```
- Specify the column numbers of the variables.
```{r}
select(flights, 4, 6, 7, 9)
```
This works, but is not good practice for two reasons.
First, the column location of variables may change, resulting in code that
may continue to run without error, but produce the wrong answer.
Second code is obfuscated, since it is not clear from the code which
variables are being selected. What variable does column 6 correspond to?
I just wrote the code, and I've already forgotten.
- Specify the names of the variables with character vector and `any_of()` or `all_of()`
```{r}
select(flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
```
```{r}
select(flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
```
This is useful because the names of the variables can be stored in a
variable and passed to `all_of()` or `any_of()`.
```{r}
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, all_of(variables))
```
These two functions replace the deprecated function `one_of()`.
- Selecting the variables by matching the start of their names using `starts_with()`.
```{r}
select(flights, starts_with("dep_"), starts_with("arr_"))
```
- Selecting the variables using regular expressions with `matches()`.
Regular expressions provide a flexible way to match string patterns
and are discussed in the [Strings](https://r4ds.had.co.nz/strings.html) chapter.
```{r}
select(flights, matches("^(dep|arr)_(time|delay)$"))
```
- Specify the names of the variables with a character vector and use the bang-bang operator (`!!`).
```{r}
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, !!variables)
```
This and the following answers use the features of **tidy evaluation** not covered in R4DS but covered in the [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) vignette.
- Specify the names of the variables in a character or list vector and use the bang-bang-bang operator.
```{r}
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, !!!variables)
```
- Specify the unquoted names of the variables in a list using `syms()` and use the bang-bang-bang operator.
```{r}
variables <- syms(c("dep_time", "dep_delay", "arr_time", "arr_delay"))
select(flights, !!!variables)
```
Some things that **don't** work are:
- Matching the ends of their names using `ends_with()` since this will incorrectly
include other variables. For example,
```{r}
select(flights, ends_with("arr_time"), ends_with("dep_time"))
```
- Matching the names using `contains()` since there is not a pattern that can
include all these variables without incorrectly including others.
```{r}
select(flights, contains("_time"), contains("arr_"))
```
</div>
### Exercise 5.4.2 {.unnumbered .exercise data-number="5.4.2"}
<div class="question">
What happens if you include the name of a variable multiple times in a `select()` call?
</div>
<div class="answer">
The `select()` call ignores the duplication. Any duplicated variables are only included once, in the first location they appear. The `select()` function does not raise an error or warning or print any message if there are duplicated variables.
```{r}
select(flights, year, month, day, year, year)
```
This behavior is useful because it means that we can use `select()` with `everything()`
in order to easily change the order of columns without having to specify the names
of all the columns.
```{r}
select(flights, arr_delay, everything())
```
</div>
### Exercise 5.4.3 {.unnumbered .exercise data-number="5.4.3"}
<div class="question">
What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
</div>
<div class="answer">
The `one_of()` function selects variables with a character vector rather than unquoted variable name arguments.
This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.
```{r}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
```
In the most recent versions of **dplyr**, `one_of` has been deprecated in favor of two functions: `all_of()` and `any_of()`.
These functions behave similarly if all variables are present in the data frame.
```{r}
select(flights, any_of(vars))
```
```{r}
select(flights, all_of(vars))
```
These functions differ in their strictness.
The function `all_of()` will raise an error if one of the variable names is not present, while `any_of()` will ignore it.
```{r error=TRUE}
vars2 <- c("year", "month", "day", "variable_not_in_the_dataframe")
select(flights, all_of(vars2))
```
```{r}
select(flights, any_of(vars2))
```
The deprecated function `one_of()` will raise a warning if an unknown column is encountered.
```{r}
select(flights, one_of(vars2))
```
In the most recent versions of **dplyr**, the `one_of()` function is less necessary due to new behavior in the selection functions.
The `select()` function can now accept the name of a vector containing the variable names you wish to select:
```{r}
select(flights, vars)
```
However there is a problem with the previous code.
The name `vars` could refer to a column named `vars` in `flights` or a different variable named `vars`.
What th code does will depend on whether or not `vars` is a column in `flights`.
If `vars` was a column in `flights`, then that code would only select the `vars` column.
For example:
```{r}
flights <- mutate(flights, vars = 1)
select(flights, vars)
```
```{r include=FALSE}
flights <- select(flights, -vars)
```
However, `vars` is not a column in `flights`, as is the case, then `select` will use the value the value of the , and select those columns.
If it has the same name or to ensure that it will not conflict with the names of the columns in the data frame, use the `!!!` (bang-bang-bang) operator.
```{r}
select(flights, !!!vars)
```
This behavior, which is used by many **tidyverse** functions, is an example of what is called non-standard evaluation (NSE) in R. See the **dplyr** vignette, [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html), for more information on this topic.
</div>
### Exercise 5.4.4 {.unnumbered .exercise data-number="5.4.4"}
<div class="question">
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
</div>
<div class="answer">
```{r}
select(flights, contains("TIME"))
```
The default behavior for `contains()` is to ignore case.
This may or may not surprise you.
If this behavior does not surprise you, that could be why it is the default.
Users searching for variable names probably have a better sense of the letters
in the variable than their capitalization.
A second, technical, reason is that dplyr works with more than R data frames.
It can also work with a variety of [databases](https://db.rstudio.com/dplyr/).
Some of these database engines have case insensitive column names, so making functions that match variable names
case insensitive by default will make the behavior of
`select()` consistent regardless of whether the table is
stored as an R data frame or in a database.
To change the behavior add the argument `ignore.case = FALSE`.
```{r}
select(flights, contains("TIME", ignore.case = FALSE))
```
</div>
## Add new variables with `mutate()` {#add-new-variables-with-mutate .r4ds-section}
### Exercise 5.5.1 {.unnumbered .exercise data-number="5.5.1"}
<div class="question">
Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
</div>
<div class="answer">
To get the departure times in the number of minutes, divide `dep_time` by 100 to get the hours since midnight and multiply by 60 and add the remainder of `dep_time` divided by 100.
For example, `1504` represents 15:04 (or 3:04 PM), which is `r 15 * 60 + 4` minutes after midnight.
To generalize this approach, we need a way to split out the hour-digits from the minute-digits.
Dividing by 100 and discarding the remainder using the integer division operator, `%/%` gives us the following.
```{r}
1504 %/% 100
```
Instead of `%/%` could also use `/` along with `trunc()` or `floor()`, but `round()` would not work.
To get the minutes, instead of discarding the remainder of the division by `100`,
we only want the remainder.
So we use the modulo operator, `%%`, discussed in the [Other Useful Functions](https://r4ds.had.co.nz/transform.html#select) section.
```{r}
1504 %% 100
```
Now, we can combine the hours (multiplied by 60 to convert them to minutes) and
minutes to get the number of minutes after midnight.
```{r}
1504 %/% 100 * 60 + 1504 %% 100
```
There is one remaining issue. Midnight is represented by `2400`, which would
correspond to `1440` minutes since midnight, but it should correspond to `0`.
After converting all the times to minutes after midnight, `x %% 1440` will convert
`1440` to zero while keeping all the other times the same.
Now we will put it all together.
The following code creates a new data frame `flights_times` with columns `dep_time_mins` and `sched_dep_time_mins`.
These columns convert `dep_time` and `sched_dep_time`, respectively, to minutes since midnight.
```{r}
flights_times <- mutate(flights,
dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
sched_dep_time_mins = (sched_dep_time %/% 100 * 60 +
sched_dep_time %% 100) %% 1440
)
# view only relevant columns
select(
flights_times, dep_time, dep_time_mins, sched_dep_time,
sched_dep_time_mins
)
```
Looking ahead to the [Functions](https://r4ds.had.co.nz/functions.html) chapter,
this is precisely the sort of situation in which it would make sense to write
a function to avoid copying and pasting code.
We could define a function `time2mins()`, which converts a vector of times in
from the format used in `flights` to minutes since midnight.
```{r}
time2mins <- function(x) {
(x %/% 100 * 60 + x %% 100) %% 1440
}
```
Using `time2mins`, the previous code simplifies to the following.
```{r}
flights_times <- mutate(flights,
dep_time_mins = time2mins(dep_time),
sched_dep_time_mins = time2mins(sched_dep_time)
)
# show only the relevant columns
select(
flights_times, dep_time, dep_time_mins, sched_dep_time,
sched_dep_time_mins
)
```
</div>
### Exercise 5.5.2 {.unnumbered .exercise data-number="5.5.2"}
<div class="question">
Compare `air_time` with `arr_time - dep_time`.
What do you expect to see?
What do you see?
What do you need to do to fix it?
</div>
<div class="answer">
I expect that `air_time` is the difference between the arrival (`arr_time`) and departure times (`dep_time`).
In other words, `air_time = arr_time - dep_time`.
To check that this relationship, I'll first need to convert the times to a form more amenable to arithmetic operations using the same calculations as the [previous exercise](#exercise-5.5.1).
```{r}
flights_airtime <-
mutate(flights,
dep_time = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
arr_time = (arr_time %/% 100 * 60 + arr_time %% 100) %% 1440,
air_time_diff = air_time - arr_time + dep_time
)
```
So, does `air_time = arr_time - dep_time`?
If so, there should be no flights with non-zero values of `air_time_diff`.
```{r}
nrow(filter(flights_airtime, air_time_diff != 0))
```
It turns out that there are many flights for which `air_time != arr_time - dep_time`.
Other than data errors, I can think of two reasons why `air_time` would not equal `arr_time - dep_time`.
1. The flight passes midnight, so `arr_time < dep_time`.
In these cases, the difference in airtime should be by 24 hours (1,440 minutes).
1. The flight crosses time zones, and the total air time will be off by hours (multiples of 60).
All flights in `flights` departed from New York City and are domestic flights in the US.
This means that flights will all be to the same or more westerly time zones.
Given the time-zones in the US, the differences due to time-zone should be 60 minutes (Central)
120 minutes (Mountain), 180 minutes (Pacific), 240 minutes (Alaska), or 300 minutes (Hawaii).
Both of these explanations have clear patterns that I would expect to see if they
were true.
In particular, in both cases, since time-zones and crossing midnight only affects the hour part of the time, all values of `air_time_diff` should be divisible by 60.
I'll visually check this hypothesis by plotting the distribution of `air_time_diff`.
If those two explanations are correct, distribution of `air_time_diff` should comprise only spikes at multiples of 60.
```{r}
ggplot(flights_airtime, aes(x = air_time_diff)) +
geom_histogram(binwidth = 1)
```
This is not the case.
While, the distribution of `air_time_diff` has modes at multiples of 60 as hypothesized,
it shows that there are many flights in which the difference between air time and local arrival and departure times is not divisible by 60.
Let's also look at flights with Los Angeles as a destination.
The discrepancy should be 180 minutes.
```{r}
ggplot(filter(flights_airtime, dest == "LAX"), aes(x = air_time_diff)) +
geom_histogram(binwidth = 1)
```
To fix these time-zone issues, I would want to convert all the times to a date-time to handle overnight flights, and from local time to a common time zone, most likely [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time), to handle flights crossing time-zones.
The `tzone` column of `nycflights13::airports` gives the time-zone of each airport.
See the ["Dates and Times"](https://r4ds.had.co.nz/dates-and-times.html) for an introduction on working with date and time data.
But that still leaves the other differences unexplained.
So what else might be going on?
There seem to be too many problems for this to be data entry problems, so I'm probably missing something.
So, I'll reread the documentation to make sure that I understand the definitions of `arr_time`, `dep_time`, and
`air_time`.
The documentation contains a link to the source of the `flights` data, <https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236>.
This documentation shows that the `flights` data does not contain the variables `TaxiIn`, `TaxiOff`, `WheelsIn`, and `WheelsOff`.
It appears that the `air_time` variable refers to flight time, which is defined as the time between wheels-off (take-off) and wheels-in (landing).
But the flight time does not include time spent on the runway taxiing to and from gates.
With this new understanding of the data, I now know that the relationship between `air_time`, `arr_time`, and `dep_time` is `air_time <= arr_time - dep_time`, supposing that the time zones of `arr_time` and `dep_time` are in the same time zone.
</div>
### Exercise 5.5.3 {.unnumbered .exercise data-number="5.5.3"}
<div class="question">
Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?
</div>
<div class="answer">
I would expect the departure delay (`dep_delay`) to be equal to the difference between scheduled departure time (`sched_dep_time`), and actual departure time (`dep_time`),
`dep_time - sched_dep_time = dep_delay`.
As with the previous question, the first step is to convert all times to the
number of minutes since midnight.
The column, `dep_delay_diff`, is the difference between the column, `dep_delay`, and
departure delay calculated directly from the scheduled and actual departure times.
```{r}
flights_deptime <-
mutate(flights,
dep_time_min = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
sched_dep_time_min = (sched_dep_time %/% 100 * 60 +
sched_dep_time %% 100) %% 1440,
dep_delay_diff = dep_delay - dep_time_min + sched_dep_time_min
)
```
Does `dep_delay_diff` equal zero for all rows?
```{r}
filter(flights_deptime, dep_delay_diff != 0)
```
No. Unlike the last question, time zones are not an issue since we are only
considering departure times.[^daylight]
However, the discrepancies could be because a flight was scheduled to depart
before midnight, but was delayed after midnight.
All of these discrepancies are exactly equal to 1440 (24 hours), and the flights with these discrepancies were scheduled to depart later in the day.
```{r}
ggplot(
filter(flights_deptime, dep_delay_diff > 0),
aes(y = sched_dep_time_min, x = dep_delay_diff)
) +
geom_point()
```
Thus the only cases in which the departure delay is not equal to the difference
in scheduled departure and actual departure times is due to a quirk in how these
columns were stored.
</div>
### Exercise 5.5.4 {.unnumbered .exercise data-number="5.5.4"}
<div class="question">
Find the 10 most delayed flights using a ranking function.
How do you want to handle ties?
Carefully read the documentation for `min_rank()`.
</div>
<div class="answer">
The **dplyr** package provides multiple functions for ranking, which differ in how they handle tied values: `row_number()`, `min_rank()`, `dense_rank()`.
To see how they work, let's create a data frame with duplicate values in a vector and see how ranking functions handle ties.
```{r}
rankme <- tibble(
x = c(10, 5, 1, 5, 5)
)
```
<!-- don't use 1-3 in order to avoid confusion with the rank function itself,
don't have them in order -->
```{r}
rankme <- mutate(rankme,
x_row_number = row_number(x),
x_min_rank = min_rank(x),
x_dense_rank = dense_rank(x)
)
arrange(rankme, x)
```
The function `row_number()` assigns each element a unique value.
The result is equivalent to the index (or row) number of each element after sorting the vector, hence its name.
The`min_rank()` and `dense_rank()` assign tied values the same rank, but differ in how they assign values to the next rank.
For each set of tied values the `min_rank()` function assigns a rank equal to the number of values less than that tied value plus one.
In contrast, the `dense_rank()` function assigns a rank equal to the number of distinct values less than that tied value plus one.
To see the difference between `dense_rank()` and `min_rank()` compare the value of `rankme$x_min_rank` and `rankme$x_dense_rank` for `x = 10`.
If I had to choose one for presenting rankings to someone else, I would use `min_rank()` since its results correspond to the most common usage of rankings in sports or other competitions.
In the code below, I use all three functions, but since there are no ties in the top 10 flights, the results don't differ.
```{r}
flights_delayed <- mutate(flights,
dep_delay_min_rank = min_rank(desc(dep_delay)),
dep_delay_row_number = row_number(desc(dep_delay)),
dep_delay_dense_rank = dense_rank(desc(dep_delay))
)
flights_delayed <- filter(flights_delayed,
!(dep_delay_min_rank > 10 | dep_delay_row_number > 10 |
dep_delay_dense_rank > 10))
flights_delayed <- arrange(flights_delayed, dep_delay_min_rank)
print(select(flights_delayed, month, day, carrier, flight, dep_delay,
dep_delay_min_rank, dep_delay_row_number, dep_delay_dense_rank),
n = Inf)
```
In addition to the functions covered here, the `rank()` function provides several more ways of ranking elements.
There are other ways to solve this problem that do not using ranking functions.
To select the top 10, sort values with `arrange()` and select the top values with `slice`:
```{r}
flights_delayed2 <- arrange(flights, desc(dep_delay))
flights_delayed2 <- slice(flights_delayed2, 1:10)
select(flights_delayed2, month, day, carrier, flight, dep_delay)
```
Alternatively, we could use the `top_n()`.
```{r}
flights_delayed3 <- top_n(flights, 10, dep_delay)
flights_delayed3 <- arrange(flights_delayed3, desc(dep_delay))
select(flights_delayed3, month, day, carrier, flight, dep_delay)
```
The previous two approaches will always select 10 rows even if there are tied values.
Ranking functions provide more control over how tied values are handled.
Those approaches will provide the 10 rows with the largest values of `dep_delay`, while ranking functions can provide all rows with the 10 largest values of `dep_delay`.
If there are no ties, these approaches are equivalent.
If there are ties, then which is more appropriate depends on the use.
</div>
### Exercise 5.5.5 {.unnumbered .exercise data-number="5.5.5"}
<div class="question">
What does `1:3 + 1:10` return? Why?
</div>
<div class="answer">
The code given in the question returns the following.
```{r warning=TRUE}
1:3 + 1:10
```
This is equivalent to the following.
```{r}
c(1 + 1, 2 + 2, 3 + 3, 1 + 4, 2 + 5, 3 + 6, 1 + 7, 2 + 8, 3 + 9, 1 + 10)
```
When adding two vectors, R recycles the shorter vector's values to create a vector of the same length as the longer vector.
The code also raises a warning that the shorter vector is not a multiple of the longer vector.
A warning is raised since when this occurs, it is often unintended and may be a bug.
</div>
### Exercise 5.5.6 {.unnumbered .exercise data-number="5.5.6"}
<div class="question">
What trigonometric functions does R provide?
</div>
<div class="answer">
All trigonometric functions are all described in a single help page, named `Trig`.
You can open the documentation for these functions with `?Trig` or by using `?` with any of the following functions, for example:`?sin`.
R provides functions for the three primary trigonometric functions: sine (`sin()`), cosine (`cos()`), and tangent (`tan()`).
The input angles to all these functions are in [radians](https://en.wikipedia.org/wiki/Radian).
```{r}
x <- seq(-3, 7, by = 1 / 2)
sin(pi * x)
cos(pi * x)
tan(pi * x)
```
In the previous code, I used the variable `pi`.
R provides the variable `pi` which is set to the value of the mathematical constant $\pi$ . [^pi]
```{r}
pi
```
Although R provides the `pi` variable, there is nothing preventing a user from changing its value.
For example, I could redefine `pi` to [3.14](https://en.wikipedia.org/wiki/Indiana_Pi_Bill) or
any other value.
```{r}
pi <- 3.14
pi
pi <- "Apple"
pi
```
For that reason, if you are using the builtin `pi` variable in computations and are paranoid, you may want to always reference it as `base::pi`.
```{r}
base::pi
```
```{r include=FALSE}
# reset value of pi
rm(pi)
```
In the previous code block, since the angles were in radians, I wrote them as $\pi$ times some number.
Since it is often easier to write radians multiple of $\pi$, R provides some convenience functions that do that.
The function `sinpi(x)`, is equivalent to `sin(pi * x)`.
The functions `cospi()` and `tanpi()` are similarly defined for the sin and tan functions, respectively.
```{r}
sinpi(x)
cospi(x)
tanpi(x)
```
R provides the function arc-cosine (`acos()`), arc-sine (`asin()`), and arc-tangent (`atan()`).
```{r}
x <- seq(-1, 1, by = 1 / 4)
acos(x)
asin(x)
atan(x)
```
Finally, R provides the function `atan2()`.
Calling `atan2(y, x)` returns the angle between the x-axis and the vector from `(0,0)` to `(x, y)`.
```{r}
atan2(c(1, 0, -1, 0), c(0, 1, 0, -1))
```
</div>
## Grouped summaries with `summarise()` {#grouped-summaries-with-summarise .r4ds-section}
### Exercise 5.6.1 {.unnumbered .exercise data-number="5.6.1"}
<div class="question">
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
- A flight is always 10 minutes late.
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Which is more important: arrival delay or departure delay?
</div>
<div class="answer">
What this question gets at is a fundamental question of data analysis: the cost function.
As analysts, the reason we are interested in flight delay because it is costly to passengers.
But it is worth thinking carefully about how it is costly and use that information in ranking and measuring these scenarios.
In many scenarios, arrival delay is more important.
In most cases, being arriving late is more costly to the passenger since it could disrupt the next stages of their travel, such as connecting flights or scheduled meetings.
If a departure is delayed without affecting the arrival time, this delay will not have those affects plans nor does it affect the total time spent traveling.