S04E13: R for Data Science - Chapter 5.6: summarize

The first of two parts on dplyr’s summarize() function


Get the Dataset and Packages

# Only if you haven't done so before, install the packages
install.packages("tidyverse")
install.packages("nycflights13")
# Load the flights dataset and tidyverse
library(nycflights13)
library(tidyverse)

Preview the Dataset

glimpse(flights)
#> Rows: 336,776
#> Columns: 19
#> $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
#> $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
#> $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
#> $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
#> $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
#> $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
#> $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
#> $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
#> $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
#> $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
#> $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
#> $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
#> $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
#> $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
#> $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…


Review of select(), filter(), mutate()

select()

The flights tibble has 19 variables – to keep things simple, we’ll focus on just a few of these for now. Let’s choose the variables (columns) carrier, flight, air_time, and dep_delay:

flights_exp <- select(flights, carrier, flight, air_time, dep_delay)

flights_exp
#> # A tibble: 336,776 × 4
#>    carrier flight air_time dep_delay
#>    <chr>    <int>    <dbl>     <dbl>
#>  1 UA        1545      227         2
#>  2 UA        1714      227         4
#>  3 AA        1141      160         2
#>  4 B6         725      183        -1
#>  5 DL         461      116        -6
#>  6 UA        1696      150        -4
#>  7 B6         507      158        -5
#>  8 EV        5708       53        -3
#>  9 B6          79      140        -3
#> 10 AA         301      138        -2
#> # … with 336,766 more rows

filter()

There are >336,000 observations (flights) in this dataset. Let’s reduce it to just American Airlines flights:

flights_exp <- filter(flights_exp, carrier == "AA")

flights_exp
#> # A tibble: 32,729 × 4
#>    carrier flight air_time dep_delay
#>    <chr>    <int>    <dbl>     <dbl>
#>  1 AA        1141      160         2
#>  2 AA         301      138        -2
#>  3 AA         707      257        -1
#>  4 AA        1895      152        -4
#>  5 AA        1837      153        13
#>  6 AA         413      192        -2
#>  7 AA         303      140        -1
#>  8 AA         711      248         0
#>  9 AA         305      143        -4
#> 10 AA        1815      142        -3
#> # … with 32,719 more rows

mutate()

The column air_time is measured in minutes. What if we wanted a new column air_time_hrs that reports the air time in hours?

flights_exp <- mutate(flights_exp, air_time_hrs = air_time/60)

flights_exp
#> # A tibble: 32,729 × 5
#>    carrier flight air_time dep_delay air_time_hrs
#>    <chr>    <int>    <dbl>     <dbl>        <dbl>
#>  1 AA        1141      160         2         2.67
#>  2 AA         301      138        -2         2.3 
#>  3 AA         707      257        -1         4.28
#>  4 AA        1895      152        -4         2.53
#>  5 AA        1837      153        13         2.55
#>  6 AA         413      192        -2         3.2 
#>  7 AA         303      140        -1         2.33
#>  8 AA         711      248         0         4.13
#>  9 AA         305      143        -4         2.38
#> 10 AA        1815      142        -3         2.37
#> # … with 32,719 more rows


Section 5.6: summarise()

Here’s an example of how to use summarise() in the simplest way. Notice the similarity in syntax with mutate():

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   delay
#>   <dbl>
#> 1  12.6

summarise() is typically combined with group_by(), calculating the new summarized variable separately for each set of grouped observations in the dataset:

by_day <- group_by(flights, year, month, day)

summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 365 × 4
#> # Groups:   year, month [12]
#>     year month   day delay
#>    <int> <int> <int> <dbl>
#>  1  2013     1     1 11.5 
#>  2  2013     1     2 13.9 
#>  3  2013     1     3 11.0 
#>  4  2013     1     4  8.95
#>  5  2013     1     5  5.73
#>  6  2013     1     6  7.15
#>  7  2013     1     7  5.42
#>  8  2013     1     8  2.55
#>  9  2013     1     9  2.28
#> 10  2013     1    10  2.84
#> # … with 355 more rows

And more than one new summarized variable can be calculated - here, there are 3:

by_dest <- group_by(flights, dest)

delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
  )

delay
#> # A tibble: 105 × 4
#>    dest  count  dist delay
#>    <chr> <int> <dbl> <dbl>
#>  1 ABQ     254 1826   4.38
#>  2 ACK     265  199   4.85
#>  3 ALB     439  143  14.4 
#>  4 ANC       8 3370  -2.5 
#>  5 ATL   17215  757. 11.3 
#>  6 AUS    2439 1514.  6.02
#>  7 AVL     275  584.  8.00
#>  8 BDL     443  116   7.05
#>  9 BGR     375  378   8.03
#> 10 BHM     297  866. 16.9 
#> # … with 95 more rows

Let’s look a bit closer at what this is doing (note, this is not in the book):

flights_sub <- select(flights, dest, distance, arr_delay)

flights_sub
#> # A tibble: 336,776 × 3
#>    dest  distance arr_delay
#>    <chr>    <dbl>     <dbl>
#>  1 IAH       1400        11
#>  2 IAH       1416        20
#>  3 MIA       1089        33
#>  4 BQN       1576       -18
#>  5 ATL        762       -25
#>  6 ORD        719        12
#>  7 FLL       1065        19
#>  8 IAD        229       -14
#>  9 MCO        944        -8
#> 10 ORD        733         8
#> # … with 336,766 more rows
albuquerque_data <- filter(flights_sub, dest == "ABQ")

albuquerque_data
#> # A tibble: 254 × 3
#>    dest  distance arr_delay
#>    <chr>    <dbl>     <dbl>
#>  1 ABQ       1826       -35
#>  2 ABQ       1826       -18
#>  3 ABQ       1826       -16
#>  4 ABQ       1826        16
#>  5 ABQ       1826       -20
#>  6 ABQ       1826       -14
#>  7 ABQ       1826       -15
#>  8 ABQ       1826       -32
#>  9 ABQ       1826       -28
#> 10 ABQ       1826       -13
#> # … with 244 more rows
mean_albuquerque_delay <- mean(albuquerque_data$arr_delay, na.rm = TRUE)

mean_albuquerque_delay
#> [1] 4.38189

Compare the value mean_albuquerque_delay to that in the summarized delay tibble created above:

mean_albuquerque_delay
#> [1] 4.38189
delay
#> # A tibble: 105 × 4
#>    dest  count  dist delay
#>    <chr> <int> <dbl> <dbl>
#>  1 ABQ     254 1826   4.38
#>  2 ACK     265  199   4.85
#>  3 ALB     439  143  14.4 
#>  4 ANC       8 3370  -2.5 
#>  5 ATL   17215  757. 11.3 
#>  6 AUS    2439 1514.  6.02
#>  7 AVL     275  584.  8.00
#>  8 BDL     443  116   7.05
#>  9 BGR     375  378   8.03
#> 10 BHM     297  866. 16.9 
#> # … with 95 more rows


Breakout Exercises 1

Start with the original flights tibble for each of the following exercises.

Exercise 1

Overall, which carrier had the longest mean arrival delay (arr_delay) in 2013? (Note, all these data are from 2013.)

Hints (click here)

Group the tibble by the carrier variable, then summarise to calculate the mean arr_delay for each group. Remember to ignore missing values by setting na.rm = TRUE within the mean() function.

Solution (click here)
carrier_grp <- group_by(flights, carrier)

summarise(carrier_grp,
  mean_delay = mean(arr_delay, na.rm = TRUE)
  ) 
#> # A tibble: 16 × 2
#>    carrier mean_delay
#>    <chr>        <dbl>
#>  1 9E           7.38 
#>  2 AA           0.364
#>  3 AS          -9.93 
#>  4 B6           9.46 
#>  5 DL           1.64 
#>  6 EV          15.8  
#>  7 F9          21.9  
#>  8 FL          20.1  
#>  9 HA          -6.92 
#> 10 MQ          10.8  
#> 11 OO          11.9  
#> 12 UA           3.56 
#> 13 US           2.13 
#> 14 VX           1.76 
#> 15 WN           9.65 
#> 16 YV          15.6

Exercise 2

Evaluate arrival delay by carrier again, but this time, evaluate only for carriers who made at least 10,000 flights.

Hints (click here)

Include a second variable in the summarized data from above to reflect the number of observations that went in to calculating the mean value for each group, then filter on this sample size variable. Consider using the function n().

Solution (click here)
carrier_delays <- summarise(carrier_grp,
  mean_delay = mean(arr_delay, na.rm = TRUE),
  n = n()
  )

filter(carrier_delays, n >= 10000)
#> # A tibble: 9 × 3
#>   carrier mean_delay     n
#>   <chr>        <dbl> <int>
#> 1 9E           7.38  18460
#> 2 AA           0.364 32729
#> 3 B6           9.46  54635
#> 4 DL           1.64  48110
#> 5 EV          15.8   54173
#> 6 MQ          10.8   26397
#> 7 UA           3.56  58665
#> 8 US           2.13  20536
#> 9 WN           9.65  12275

Exercise 3

Based on these data, what airport is the worst to fly in to with regards to arriving on time? In other words, which airport (dest) is associated with the highest average arrival delays?

Hints (click here)

Group the tibble by the dest variable, summarise with the mean arr_delay for each group, then use arrange() to sort the new variable in descending order, which can be done with desc().

Solution (click here)
dest_grp <- group_by(flights, dest)

dest_delays <- summarise(dest_grp,
  mean_delay = mean(arr_delay, na.rm = TRUE)
  )

arrange(dest_delays, desc(mean_delay))
#> # A tibble: 105 × 2
#>    dest  mean_delay
#>    <chr>      <dbl>
#>  1 CAE         41.8
#>  2 TUL         33.7
#>  3 OKC         30.6
#>  4 JAC         28.1
#>  5 TYS         24.1
#>  6 MSN         20.2
#>  7 RIC         20.1
#>  8 CAK         19.7
#>  9 DSM         19.0
#> 10 GRR         18.2
#> # … with 95 more rows


Using Pipes

We did this above…

by_dest <- group_by(flights, dest)

delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
  )

That code can be rewritten with the pipe as follows…

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )

Notice here we didn’t have to create intermediate/temporary tibbles.



Breakout Exercises 2

Here we’ll redo the same analyses as in the first breakout session, but this time using the pipe.

Exercise 1

Overall, which carrier had the longest mean arrival delay (arr_delay) in 2013? (Note, all these data are from 2013.)

Solution (click here)
flights %>% 
  group_by(carrier) %>%
  summarise(mean_delay = mean(arr_delay, na.rm = TRUE))
#> # A tibble: 16 × 2
#>    carrier mean_delay
#>    <chr>        <dbl>
#>  1 9E           7.38 
#>  2 AA           0.364
#>  3 AS          -9.93 
#>  4 B6           9.46 
#>  5 DL           1.64 
#>  6 EV          15.8  
#>  7 F9          21.9  
#>  8 FL          20.1  
#>  9 HA          -6.92 
#> 10 MQ          10.8  
#> 11 OO          11.9  
#> 12 UA           3.56 
#> 13 US           2.13 
#> 14 VX           1.76 
#> 15 WN           9.65 
#> 16 YV          15.6

Exercise 2

Evaluate arrival delay by carrier again, but this time, evaluate only for carriers who made at least 10,000 flights.

Solution (click here)
flights %>%
  group_by(carrier) %>%
  summarise(
    mean_delay = mean(arr_delay, na.rm = TRUE),
    n = n()
    ) %>%
  filter(n > 10000)
#> # A tibble: 9 × 3
#>   carrier mean_delay     n
#>   <chr>        <dbl> <int>
#> 1 9E           7.38  18460
#> 2 AA           0.364 32729
#> 3 B6           9.46  54635
#> 4 DL           1.64  48110
#> 5 EV          15.8   54173
#> 6 MQ          10.8   26397
#> 7 UA           3.56  58665
#> 8 US           2.13  20536
#> 9 WN           9.65  12275

Exercise 3

Based on these data, what airport is the worst to fly in to with regards to arriving on time? In other words, which airport (dest) is associated with the highest average arrival delays?

Solution (click here)
flights %>%
  group_by(dest) %>%
  summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(mean_delay))
#> # A tibble: 105 × 2
#>    dest  mean_delay
#>    <chr>      <dbl>
#>  1 CAE         41.8
#>  2 TUL         33.7
#>  3 OKC         30.6
#>  4 JAC         28.1
#>  5 TYS         24.1
#>  6 MSN         20.2
#>  7 RIC         20.1
#>  8 CAK         19.7
#>  9 DSM         19.0
#> 10 GRR         18.2
#> # … with 95 more rows


Mike Sovic
Mike Sovic
Bioinformatician at CAPS