S04E14: R for Data Science - Chapter 5.6: summarize, some more

Continuing to use summarize() to extract information from our data


Get data and package

# only if you haven't done so before, install the packages
install.packages("tidyverse")
install.packages("nycflights13")
install.packages("palmerpenguins")

Load libraries

library(tidyverse) # for everything
library(palmerpenguins) # for penguins data
library(nycflights13) # for flights data

Preview the Dataset

We are going to start with the penguins.

glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…


Review of summarize()

Different from select() (which picks columns but retains all rows) and filter() (which picks observations but retains all columns), summarize() creates a wholly new dataframe by making calculations or pulling parts of your original dataframe.

For example, if we want to know the mean body_mass_g across all penguins in the dataset, we can calculate that like this:

penguins %>%
  summarize(mean_body_mass = mean(body_mass_g, # calculate mean body mass
                                  na.rm = TRUE)) # remove the NAs
#> # A tibble: 1 × 1
#>   mean_body_mass
#>            <dbl>
#> 1          4202.

But we may actually be interested in the different mean body_mass_g calculated separately for the different species of penguins. We can do that by adding the function group_by() like this:

penguins %>%
  group_by(species) %>% # group by species
  summarize(mean_body_mass = mean(body_mass_g, # calculate mean body mass by species
                                  na.rm = TRUE)) # remove the NAs
#> # A tibble: 3 × 2
#>   species   mean_body_mass
#>   <fct>              <dbl>
#> 1 Adelie             3701.
#> 2 Chinstrap          3733.
#> 3 Gentoo             5076.

And, you can also add more than one group_by() factor, for example, by the full combination species and sex variables.

penguins %>%
  group_by(species, sex) %>% # group by species and sex combos
  summarize(mean_body_mass = mean(body_mass_g, # calculate mean body mass by species
                                  na.rm = TRUE)) # remove the NAs
#> # A tibble: 8 × 3
#> # Groups:   species [3]
#>   species   sex    mean_body_mass
#>   <fct>     <fct>           <dbl>
#> 1 Adelie    female          3369.
#> 2 Adelie    male            4043.
#> 3 Adelie    NA              3540 
#> 4 Chinstrap female          3527.
#> 5 Chinstrap male            3939.
#> 6 Gentoo    female          4680.
#> 7 Gentoo    male            5485.
#> 8 Gentoo    NA              4588.


Using more functions in summarize()

Last week we went over using a few different functions within summarize().

  • mean(): calculates the mean (set na.rm = TRUE if you was the NAs to be removed before calculating)
  • n(): counts the number of observations

But, there are lots more functions you might be interested to add into summarize(), including:

  • median(): when you want to calculate the median instead of the mean
  • sd(): calculates the standard deviation
  • IQR(): calculates the interquartile range
  • sum(): calculates the sum
  • min(): calculates the minimum
  • max(): calculates the maximum
  • n_distinct(): calculates the number of distinct or unique occurences

You can always combine functions or write your own functions to use within summarize() as well. For example, let’s say you wanted the standard error of the mean, but there isn’t a function built into tidyverse or base R that does that. You can still get your standard error but using a combination of other functions!

penguins %>%
  group_by(species) %>%
  drop_na(species, body_mass_g) %>% # dropping observations with missing values
  summarize(mean_body_mass = mean(body_mass_g), # mean body mass
            n_observations = n(), # how many observations are there
            std_error_body_mass = sd(body_mass_g)/(sqrt(n()))) # sd/sqrt(n)
#> # A tibble: 3 × 4
#>   species   mean_body_mass n_observations std_error_body_mass
#>   <fct>              <dbl>          <int>               <dbl>
#> 1 Adelie             3701.            151                37.3
#> 2 Chinstrap          3733.             68                46.6
#> 3 Gentoo             5076.            123                45.5


Breakout Exercises 1

Back to the flights tibble for each of the following exercises.

Exercise 1.1

Determine the mean, standard deviation, min, and max departure delay for each airport of origin. Also determine how many observations you have for each airport of origin.

Hints (click here)

Group the tibble by origin. Summarize for each summary you want to calculate. Make sure you exclude missing data.

Solution (click here)
flights %>%
  group_by(origin) %>%
  drop_na(dep_delay) %>%
  summarize(mean_dep_delay = mean(dep_delay),
            sd_dep_delay = sd(dep_delay),
            min_dep_delay = min(dep_delay),
            max_dep_delay = max(dep_delay),
            n_dep_delay = n())
#> # A tibble: 3 × 6
#>   origin mean_dep_delay sd_dep_delay min_dep_delay max_dep_delay n_dep_delay
#>   <chr>           <dbl>        <dbl>         <dbl>         <dbl>       <int>
#> 1 EWR              15.1         41.3           -25          1126      117596
#> 2 JFK              12.1         39.0           -43          1301      109416
#> 3 LGA              10.3         40.0           -33           911      101509

Exercise 1.2

How many distinct destinations are there to fly from for each EWR, JFK, and LGA?

Hints (click here)

Think about what you want to group_by() and what of the new functions we talked about today you could use in summarize().

Solution (click here)
flights %>%
  group_by(origin) %>% 
  summarize(n_distinct_dest = n_distinct(dest))
#> # A tibble: 3 × 2
#>   origin n_distinct_dest
#>   <chr>            <int>
#> 1 EWR                 86
#> 2 JFK                 70
#> 3 LGA                 68

Using slice_()

Using slice_() you can subset rows based on their position.

penguins %>%
  drop_na(species, body_mass_g) %>% # dropping observations with missing values
  group_by(species) %>%
  summarize(mean_body_mass = mean(body_mass_g)) %>%
  slice_min(mean_body_mass, n = 1) # min based on mean_body_mass, just give me 1
#> # A tibble: 1 × 2
#>   species mean_body_mass
#>   <fct>            <dbl>
#> 1 Adelie           3701.

You can use slice_() functions on dataframes themselves (not just on summarized data).

For example:

penguins %>%
  drop_na(body_mass_g) %>% # dropping observations with missing values
  slice_max(body_mass_g, n = 10) # the 10 heaviest penguins
#> # A tibble: 11 × 8
#>    species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex    year
#>    <fct>   <fct>           <dbl>         <dbl>         <int>   <int> <fct> <int>
#>  1 Gentoo  Biscoe           49.2          15.2           221    6300 male   2007
#>  2 Gentoo  Biscoe           59.6          17             230    6050 male   2007
#>  3 Gentoo  Biscoe           51.1          16.3           220    6000 male   2008
#>  4 Gentoo  Biscoe           48.8          16.2           222    6000 male   2009
#>  5 Gentoo  Biscoe           45.2          16.4           223    5950 male   2008
#>  6 Gentoo  Biscoe           49.8          15.9           229    5950 male   2009
#>  7 Gentoo  Biscoe           48.4          14.6           213    5850 male   2007
#>  8 Gentoo  Biscoe           49.3          15.7           217    5850 male   2007
#>  9 Gentoo  Biscoe           55.1          16             230    5850 male   2009
#> 10 Gentoo  Biscoe           49.5          16.2           229    5800 male   2008
#> 11 Gentoo  Biscoe           48.6          16             230    5800 male   2008
#> # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Note, we actually got 11, that is because in 10th place there is a tie. Ties are kept by default, but can be turned off by setting with_ties = FALSE.

You can also set the proportion of results to return using prop =.

penguins %>%
  drop_na(body_mass_g) %>% # dropping observations with missing values
  slice_max(body_mass_g, prop = 0.1) # the top 10% heaviest penguins
#> # A tibble: 34 × 8
#>    species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex    year
#>    <fct>   <fct>           <dbl>         <dbl>         <int>   <int> <fct> <int>
#>  1 Gentoo  Biscoe           49.2          15.2           221    6300 male   2007
#>  2 Gentoo  Biscoe           59.6          17             230    6050 male   2007
#>  3 Gentoo  Biscoe           51.1          16.3           220    6000 male   2008
#>  4 Gentoo  Biscoe           48.8          16.2           222    6000 male   2009
#>  5 Gentoo  Biscoe           45.2          16.4           223    5950 male   2008
#>  6 Gentoo  Biscoe           49.8          15.9           229    5950 male   2009
#>  7 Gentoo  Biscoe           48.4          14.6           213    5850 male   2007
#>  8 Gentoo  Biscoe           49.3          15.7           217    5850 male   2007
#>  9 Gentoo  Biscoe           55.1          16             230    5850 male   2009
#> 10 Gentoo  Biscoe           49.5          16.2           229    5800 male   2008
#> # … with 24 more rows, and abbreviated variable names ¹​flipper_length_mm,
#> #   ²​body_mass_g

If data are grouped, the operation will be performed group wise, like we see below.

penguins %>%
  drop_na(body_mass_g, species) %>% # dropping observations with missing values
  group_by(species) %>%
  slice_min(body_mass_g, n = 1) 
#> # A tibble: 4 × 8
#> # Groups:   species [3]
#>   species   island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex    year
#>   <fct>     <fct>           <dbl>         <dbl>        <int>   <int> <fct> <int>
#> 1 Adelie    Biscoe           36.5          16.6          181    2850 fema…  2008
#> 2 Adelie    Biscoe           36.4          17.1          184    2850 fema…  2008
#> 3 Chinstrap Dream            46.9          16.6          192    2700 fema…  2008
#> 4 Gentoo    Biscoe           42.7          13.7          208    3950 fema…  2008
#> # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

We get the minimum body_mass_g for each species, and in this case we get 2 for Adelie penguins because there is a tie.



Using summarize() with across()

There are also ways that you can combine summarize() to function across() your different variables. The function where() allows you to pick variables where the function is TRUE.

penguins %>%
  drop_na() %>%
  group_by(species) %>%
  summarize(across(where(is.numeric), mean))
#> # A tibble: 3 × 6
#>   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
#>   <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
#> 1 Adelie              38.8          18.3              190.       3706. 2008.
#> 2 Chinstrap           48.8          18.4              196.       3733. 2008.
#> 3 Gentoo              47.6          15.0              217.       5092. 2008.
# from column bill_length_mm to column flipper_length_mm
penguins %>%
  drop_na() %>%
  group_by(species) %>%
  summarize(across(bill_length_mm:flipper_length_mm, mean))
#> # A tibble: 3 × 4
#>   species   bill_length_mm bill_depth_mm flipper_length_mm
#>   <fct>              <dbl>         <dbl>             <dbl>
#> 1 Adelie              38.8          18.3              190.
#> 2 Chinstrap           48.8          18.4              196.
#> 3 Gentoo              47.6          15.0              217.
# using the helper contains()
penguins %>%
  drop_na() %>%
  group_by(species) %>%
  summarize(across(contains("mm"), mean))
#> # A tibble: 3 × 4
#>   species   bill_length_mm bill_depth_mm flipper_length_mm
#>   <fct>              <dbl>         <dbl>             <dbl>
#> 1 Adelie              38.8          18.3              190.
#> 2 Chinstrap           48.8          18.4              196.
#> 3 Gentoo              47.6          15.0              217.

All of these helper functions can be used outside of summarize() too.



Breakout Exercises 2

Let’s practice using slice() and summarize() with across(), still using the flights data.

Exercise 2.1

What is the tail number (tailnum) for the plane that has, on average, the least arrival delay?? What about the most arrival delay? How many flights did this plane take in this dataset? How would your answer change if you required that a plane take at least 50 flights?

Hints (click here)

Group the tibble by the tailnum variable, and summarize to get mean arr_delay and also n(). Then pick the right filter() parameters if necessary, and the appropriate slice_() function.

Solution (click here)

Least delayed flights

# any number of flights
flights %>%
  group_by(tailnum) %>%
  summarize(mean_arr_delay = mean(arr_delay),
            n_flights = n()) %>%
  slice_min(mean_arr_delay, n = 1)
#> # A tibble: 1 × 3
#>   tailnum mean_arr_delay n_flights
#>   <chr>            <dbl>     <int>
#> 1 N560AS             -53         1
# at least 50 flights
flights %>%
  group_by(tailnum) %>%
  summarize(mean_arr_delay = mean(arr_delay),
            n_flights = n()) %>%
  filter(n_flights >= 50) %>%
  slice_min(mean_arr_delay, n = 1)
#> # A tibble: 1 × 3
#>   tailnum mean_arr_delay n_flights
#>   <chr>            <dbl>     <int>
#> 1 N548UW           -13.0        52

Most delayed flights

# any number of flights
flights %>%
  group_by(tailnum) %>%
  summarize(mean_arr_delay = mean(arr_delay),
            n_flights = n()) %>%
  slice_max(mean_arr_delay, n = 1)
#> # A tibble: 1 × 3
#>   tailnum mean_arr_delay n_flights
#>   <chr>            <dbl>     <int>
#> 1 N844MH             320         1
# at least 50 flights
flights %>%
  group_by(tailnum) %>%
  summarize(mean_arr_delay = mean(arr_delay),
            n_flights = n()) %>%
  filter(n_flights >= 50) %>%
  slice_max(mean_arr_delay, n = 1)
#> # A tibble: 1 × 3
#>   tailnum mean_arr_delay n_flights
#>   <chr>            <dbl>     <int>
#> 1 N8698A            27.5        57


Exercise 2.2

Calculate the median for air_time, arr_delay and dep_delay by origin. Try to not do each manually.

Hints (click here)

Group the tibble by the origin variable, and combine summarize() and across() to get the median for each variable. Try listing them together using c().

Solution (click here)
flights %>%
  drop_na(air_time, arr_delay, dep_delay, origin) %>%
  group_by(origin) %>%
  summarize(across(c(air_time, arr_delay, dep_delay), median, 
                   .names = "{col}_median")) # extra fun thing to rename columns
#> # A tibble: 3 × 4
#>   origin air_time_median arr_delay_median dep_delay_median
#>   <chr>            <dbl>            <dbl>            <dbl>
#> 1 EWR                130               -4               -1
#> 2 JFK                149               -6               -1
#> 3 LGA                115               -5               -3

Exercise 2.3

Which destinations have the longest maximum arrival delay? Which destinations have the shortest? Pull data for the top 10 for both the longest and shortest maximum arrival delay. Keep track of how many flights there are to each location in case that might be useful information.

Hints (click here)

Group the tibble by the dest variable, and summarize to get max arr_delay and also n(). Then pick the right slice_() function.

Solution (click here) Least delayed
flights %>%
  drop_na(dest, arr_delay) %>%
  group_by(dest) %>%
  summarize(max_arr_delay = max(arr_delay),
            n_flights = n()) %>%
  slice_min(max_arr_delay, n = 10)
#> # A tibble: 10 × 3
#>    dest  max_arr_delay n_flights
#>    <chr>         <dbl>     <int>
#>  1 LEX             -22         1
#>  2 PSP              17        18
#>  3 ANC              39         8
#>  4 HDN              43        14
#>  5 EYW              45        17
#>  6 SBN              53        10
#>  7 MTJ             101        14
#>  8 ILM             143       107
#>  9 ABQ             153       254
#> 10 BZN             154        35

Most delayed

flights %>%
  drop_na(dest, arr_delay) %>%
  group_by(dest) %>%
  summarize(max_arr_delay = max(arr_delay),
            n_flights = n()) %>%
  slice_max(max_arr_delay, n = 10)
#> # A tibble: 10 × 3
#>    dest  max_arr_delay n_flights
#>    <chr>         <dbl>     <int>
#>  1 HNL            1272       701
#>  2 CMH            1127      3326
#>  3 ORD            1109     16566
#>  4 SFO            1007     13173
#>  5 CVG             989      3725
#>  6 TPA             931      7390
#>  7 MSP             915      6929
#>  8 ATL             895     16837
#>  9 MIA             878     11593
#> 10 LAS             852      5952


Jessica Cooperstone
Jessica Cooperstone
Assistant Professor at HCS