S04E13: R for Data Science - Chapter 5.6: summarize
The first of two parts on dplyr’s summarize()
function
Get the Dataset and Packages
# Only if you haven't done so before, install the packages
install.packages("tidyverse")
install.packages("nycflights13")
# Load the flights dataset and tidyverse
library(nycflights13)
library(tidyverse)
Preview the Dataset
glimpse(flights)
#> Rows: 336,776
#> Columns: 19
#> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
#> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
#> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
#> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
#> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
#> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
#> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
#> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
#> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
#> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
#> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
#> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
#> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
#> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
#> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Review of select()
, filter()
, mutate()
select()
The flights tibble has 19 variables – to keep things simple, we’ll focus on just a few of these for now. Let’s choose the variables (columns) carrier
, flight
, air_time
, and dep_delay
:
flights_exp <- select(flights, carrier, flight, air_time, dep_delay)
flights_exp
#> # A tibble: 336,776 × 4
#> carrier flight air_time dep_delay
#> <chr> <int> <dbl> <dbl>
#> 1 UA 1545 227 2
#> 2 UA 1714 227 4
#> 3 AA 1141 160 2
#> 4 B6 725 183 -1
#> 5 DL 461 116 -6
#> 6 UA 1696 150 -4
#> 7 B6 507 158 -5
#> 8 EV 5708 53 -3
#> 9 B6 79 140 -3
#> 10 AA 301 138 -2
#> # … with 336,766 more rows
filter()
There are >336,000 observations (flights) in this dataset. Let’s reduce it to just American Airlines flights:
flights_exp <- filter(flights_exp, carrier == "AA")
flights_exp
#> # A tibble: 32,729 × 4
#> carrier flight air_time dep_delay
#> <chr> <int> <dbl> <dbl>
#> 1 AA 1141 160 2
#> 2 AA 301 138 -2
#> 3 AA 707 257 -1
#> 4 AA 1895 152 -4
#> 5 AA 1837 153 13
#> 6 AA 413 192 -2
#> 7 AA 303 140 -1
#> 8 AA 711 248 0
#> 9 AA 305 143 -4
#> 10 AA 1815 142 -3
#> # … with 32,719 more rows
mutate()
The column air_time
is measured in minutes. What if we wanted a new column air_time_hrs
that reports the air time in hours?
flights_exp <- mutate(flights_exp, air_time_hrs = air_time/60)
flights_exp
#> # A tibble: 32,729 × 5
#> carrier flight air_time dep_delay air_time_hrs
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 AA 1141 160 2 2.67
#> 2 AA 301 138 -2 2.3
#> 3 AA 707 257 -1 4.28
#> 4 AA 1895 152 -4 2.53
#> 5 AA 1837 153 13 2.55
#> 6 AA 413 192 -2 3.2
#> 7 AA 303 140 -1 2.33
#> 8 AA 711 248 0 4.13
#> 9 AA 305 143 -4 2.38
#> 10 AA 1815 142 -3 2.37
#> # … with 32,719 more rows
Section 5.6: summarise()
Here’s an example of how to use summarise()
in the simplest way. Notice the similarity in syntax with mutate()
:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 × 1
#> delay
#> <dbl>
#> 1 12.6
summarise()
is typically combined with group_by()
, calculating the new summarized variable separately for each set of grouped observations in the dataset:
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 365 × 4
#> # Groups: year, month [12]
#> year month day delay
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.5
#> 2 2013 1 2 13.9
#> 3 2013 1 3 11.0
#> 4 2013 1 4 8.95
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> 7 2013 1 7 5.42
#> 8 2013 1 8 2.55
#> 9 2013 1 9 2.28
#> 10 2013 1 10 2.84
#> # … with 355 more rows
And more than one new summarized variable can be calculated - here, there are 3:
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay
#> # A tibble: 105 × 4
#> dest count dist delay
#> <chr> <int> <dbl> <dbl>
#> 1 ABQ 254 1826 4.38
#> 2 ACK 265 199 4.85
#> 3 ALB 439 143 14.4
#> 4 ANC 8 3370 -2.5
#> 5 ATL 17215 757. 11.3
#> 6 AUS 2439 1514. 6.02
#> 7 AVL 275 584. 8.00
#> 8 BDL 443 116 7.05
#> 9 BGR 375 378 8.03
#> 10 BHM 297 866. 16.9
#> # … with 95 more rows
Let’s look a bit closer at what this is doing (note, this is not in the book):
flights_sub <- select(flights, dest, distance, arr_delay)
flights_sub
#> # A tibble: 336,776 × 3
#> dest distance arr_delay
#> <chr> <dbl> <dbl>
#> 1 IAH 1400 11
#> 2 IAH 1416 20
#> 3 MIA 1089 33
#> 4 BQN 1576 -18
#> 5 ATL 762 -25
#> 6 ORD 719 12
#> 7 FLL 1065 19
#> 8 IAD 229 -14
#> 9 MCO 944 -8
#> 10 ORD 733 8
#> # … with 336,766 more rows
albuquerque_data <- filter(flights_sub, dest == "ABQ")
albuquerque_data
#> # A tibble: 254 × 3
#> dest distance arr_delay
#> <chr> <dbl> <dbl>
#> 1 ABQ 1826 -35
#> 2 ABQ 1826 -18
#> 3 ABQ 1826 -16
#> 4 ABQ 1826 16
#> 5 ABQ 1826 -20
#> 6 ABQ 1826 -14
#> 7 ABQ 1826 -15
#> 8 ABQ 1826 -32
#> 9 ABQ 1826 -28
#> 10 ABQ 1826 -13
#> # … with 244 more rows
mean_albuquerque_delay <- mean(albuquerque_data$arr_delay, na.rm = TRUE)
mean_albuquerque_delay
#> [1] 4.38189
Compare the value mean_albuquerque_delay
to that in the summarized delay
tibble created above:
mean_albuquerque_delay
#> [1] 4.38189
delay
#> # A tibble: 105 × 4
#> dest count dist delay
#> <chr> <int> <dbl> <dbl>
#> 1 ABQ 254 1826 4.38
#> 2 ACK 265 199 4.85
#> 3 ALB 439 143 14.4
#> 4 ANC 8 3370 -2.5
#> 5 ATL 17215 757. 11.3
#> 6 AUS 2439 1514. 6.02
#> 7 AVL 275 584. 8.00
#> 8 BDL 443 116 7.05
#> 9 BGR 375 378 8.03
#> 10 BHM 297 866. 16.9
#> # … with 95 more rows
Breakout Exercises 1
Start with the original flights
tibble for each of the following exercises.
Exercise 1
Overall, which carrier had the longest mean arrival delay (arr_delay) in 2013? (Note, all these data are from 2013.)
Hints (click here)
Group the tibble by the carrier variable, then summarise to calculate the mean arr_delay for each group. Remember to ignore missing values by setting na.rm = TRUE
within the mean()
function.
Solution (click here)
carrier_grp <- group_by(flights, carrier)
summarise(carrier_grp,
mean_delay = mean(arr_delay, na.rm = TRUE)
)
#> # A tibble: 16 × 2
#> carrier mean_delay
#> <chr> <dbl>
#> 1 9E 7.38
#> 2 AA 0.364
#> 3 AS -9.93
#> 4 B6 9.46
#> 5 DL 1.64
#> 6 EV 15.8
#> 7 F9 21.9
#> 8 FL 20.1
#> 9 HA -6.92
#> 10 MQ 10.8
#> 11 OO 11.9
#> 12 UA 3.56
#> 13 US 2.13
#> 14 VX 1.76
#> 15 WN 9.65
#> 16 YV 15.6
Exercise 2
Evaluate arrival delay by carrier again, but this time, evaluate only for carriers who made at least 10,000 flights.
Hints (click here)
Include a second variable in the summarized data from above to reflect the number of observations that went in to calculating the mean value for each group, then filter on this sample size variable. Consider using the function n()
.
Solution (click here)
carrier_delays <- summarise(carrier_grp,
mean_delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
filter(carrier_delays, n >= 10000)
#> # A tibble: 9 × 3
#> carrier mean_delay n
#> <chr> <dbl> <int>
#> 1 9E 7.38 18460
#> 2 AA 0.364 32729
#> 3 B6 9.46 54635
#> 4 DL 1.64 48110
#> 5 EV 15.8 54173
#> 6 MQ 10.8 26397
#> 7 UA 3.56 58665
#> 8 US 2.13 20536
#> 9 WN 9.65 12275
Exercise 3
Based on these data, what airport is the worst to fly in to with regards to arriving on time? In other words, which airport (dest
) is associated with the highest average arrival delays?
Hints (click here)
Group the tibble by the dest
variable, summarise with the mean arr_delay
for each group, then use arrange()
to sort the new variable in descending order, which can be done with desc()
.
Solution (click here)
dest_grp <- group_by(flights, dest)
dest_delays <- summarise(dest_grp,
mean_delay = mean(arr_delay, na.rm = TRUE)
)
arrange(dest_delays, desc(mean_delay))
#> # A tibble: 105 × 2
#> dest mean_delay
#> <chr> <dbl>
#> 1 CAE 41.8
#> 2 TUL 33.7
#> 3 OKC 30.6
#> 4 JAC 28.1
#> 5 TYS 24.1
#> 6 MSN 20.2
#> 7 RIC 20.1
#> 8 CAK 19.7
#> 9 DSM 19.0
#> 10 GRR 18.2
#> # … with 95 more rows
Using Pipes
We did this above…
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
That code can be rewritten with the pipe as follows…
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
Notice here we didn’t have to create intermediate/temporary tibbles.
Breakout Exercises 2
Here we’ll redo the same analyses as in the first breakout session, but this time using the pipe.
Exercise 1
Overall, which carrier had the longest mean arrival delay (arr_delay) in 2013? (Note, all these data are from 2013.)
Solution (click here)
flights %>%
group_by(carrier) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE))
#> # A tibble: 16 × 2
#> carrier mean_delay
#> <chr> <dbl>
#> 1 9E 7.38
#> 2 AA 0.364
#> 3 AS -9.93
#> 4 B6 9.46
#> 5 DL 1.64
#> 6 EV 15.8
#> 7 F9 21.9
#> 8 FL 20.1
#> 9 HA -6.92
#> 10 MQ 10.8
#> 11 OO 11.9
#> 12 UA 3.56
#> 13 US 2.13
#> 14 VX 1.76
#> 15 WN 9.65
#> 16 YV 15.6
Exercise 2
Evaluate arrival delay by carrier again, but this time, evaluate only for carriers who made at least 10,000 flights.
Solution (click here)
flights %>%
group_by(carrier) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = TRUE),
n = n()
) %>%
filter(n > 10000)
#> # A tibble: 9 × 3
#> carrier mean_delay n
#> <chr> <dbl> <int>
#> 1 9E 7.38 18460
#> 2 AA 0.364 32729
#> 3 B6 9.46 54635
#> 4 DL 1.64 48110
#> 5 EV 15.8 54173
#> 6 MQ 10.8 26397
#> 7 UA 3.56 58665
#> 8 US 2.13 20536
#> 9 WN 9.65 12275
Exercise 3
Based on these data, what airport is the worst to fly in to with regards to arriving on time? In other words, which airport (dest
) is associated with the highest average arrival delays?
Solution (click here)
flights %>%
group_by(dest) %>%
summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_delay))
#> # A tibble: 105 × 2
#> dest mean_delay
#> <chr> <dbl>
#> 1 CAE 41.8
#> 2 TUL 33.7
#> 3 OKC 30.6
#> 4 JAC 28.1
#> 5 TYS 24.1
#> 6 MSN 20.2
#> 7 RIC 20.1
#> 8 CAK 19.7
#> 9 DSM 19.0
#> 10 GRR 18.2
#> # … with 95 more rows