S04E14: R for Data Science - Chapter 5.6: summarize, some more
Continuing to use summarize() to extract information from our data
Get data and package
# only if you haven't done so before, install the packages
install.packages("tidyverse")
install.packages("nycflights13")
install.packages("palmerpenguins")
Load libraries
library(tidyverse) # for everything
library(palmerpenguins) # for penguins data
library(nycflights13) # for flights data
Preview the Dataset
We are going to start with the penguins.
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Review of summarize()
Different from select()
(which picks columns but retains all rows) and filter()
(which picks observations but retains all columns), summarize()
creates a wholly new dataframe by making calculations or pulling parts of your original dataframe.
For example, if we want to know the mean body_mass_g
across all penguins in the dataset, we can calculate that like this:
penguins %>%
summarize(mean_body_mass = mean(body_mass_g, # calculate mean body mass
na.rm = TRUE)) # remove the NAs
#> # A tibble: 1 × 1
#> mean_body_mass
#> <dbl>
#> 1 4202.
But we may actually be interested in the different mean body_mass_g
calculated separately for the different species
of penguins. We can do that by adding the function group_by()
like this:
penguins %>%
group_by(species) %>% # group by species
summarize(mean_body_mass = mean(body_mass_g, # calculate mean body mass by species
na.rm = TRUE)) # remove the NAs
#> # A tibble: 3 × 2
#> species mean_body_mass
#> <fct> <dbl>
#> 1 Adelie 3701.
#> 2 Chinstrap 3733.
#> 3 Gentoo 5076.
And, you can also add more than one group_by()
factor, for example, by the full combination species
and sex
variables.
penguins %>%
group_by(species, sex) %>% # group by species and sex combos
summarize(mean_body_mass = mean(body_mass_g, # calculate mean body mass by species
na.rm = TRUE)) # remove the NAs
#> # A tibble: 8 × 3
#> # Groups: species [3]
#> species sex mean_body_mass
#> <fct> <fct> <dbl>
#> 1 Adelie female 3369.
#> 2 Adelie male 4043.
#> 3 Adelie NA 3540
#> 4 Chinstrap female 3527.
#> 5 Chinstrap male 3939.
#> 6 Gentoo female 4680.
#> 7 Gentoo male 5485.
#> 8 Gentoo NA 4588.
Using more functions in summarize()
Last week we went over using a few different functions within summarize()
.
mean()
: calculates the mean (setna.rm = TRUE
if you was the NAs to be removed before calculating)n()
: counts the number of observations
But, there are lots more functions you might be interested to add into summarize()
, including:
median()
: when you want to calculate the median instead of the meansd()
: calculates the standard deviationIQR()
: calculates the interquartile rangesum()
: calculates the summin()
: calculates the minimummax()
: calculates the maximumn_distinct()
: calculates the number of distinct or unique occurences
You can always combine functions or write your own functions to use within summarize()
as well. For example, let’s say you wanted the standard error of the mean, but there isn’t a function built into tidyverse or base R that does that. You can still get your standard error but using a combination of other functions!
penguins %>%
group_by(species) %>%
drop_na(species, body_mass_g) %>% # dropping observations with missing values
summarize(mean_body_mass = mean(body_mass_g), # mean body mass
n_observations = n(), # how many observations are there
std_error_body_mass = sd(body_mass_g)/(sqrt(n()))) # sd/sqrt(n)
#> # A tibble: 3 × 4
#> species mean_body_mass n_observations std_error_body_mass
#> <fct> <dbl> <int> <dbl>
#> 1 Adelie 3701. 151 37.3
#> 2 Chinstrap 3733. 68 46.6
#> 3 Gentoo 5076. 123 45.5
Breakout Exercises 1
Back to the flights
tibble for each of the following exercises.
Exercise 1.1
Determine the mean, standard deviation, min, and max departure delay for each airport of origin. Also determine how many observations you have for each airport of origin.
Hints (click here)
Group the tibble by origin
. Summarize for each summary you want to calculate. Make sure you exclude missing data.
Solution (click here)
flights %>%
group_by(origin) %>%
drop_na(dep_delay) %>%
summarize(mean_dep_delay = mean(dep_delay),
sd_dep_delay = sd(dep_delay),
min_dep_delay = min(dep_delay),
max_dep_delay = max(dep_delay),
n_dep_delay = n())
#> # A tibble: 3 × 6
#> origin mean_dep_delay sd_dep_delay min_dep_delay max_dep_delay n_dep_delay
#> <chr> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 EWR 15.1 41.3 -25 1126 117596
#> 2 JFK 12.1 39.0 -43 1301 109416
#> 3 LGA 10.3 40.0 -33 911 101509
Exercise 1.2
How many distinct destinations are there to fly from for each EWR, JFK, and LGA?
Hints (click here)
Think about what you want to group_by()
and what of the new functions we talked about today you could use in summarize()
.
Solution (click here)
flights %>%
group_by(origin) %>%
summarize(n_distinct_dest = n_distinct(dest))
#> # A tibble: 3 × 2
#> origin n_distinct_dest
#> <chr> <int>
#> 1 EWR 86
#> 2 JFK 70
#> 3 LGA 68
Using slice_()
Using slice_()
you can subset rows based on their position.
penguins %>%
drop_na(species, body_mass_g) %>% # dropping observations with missing values
group_by(species) %>%
summarize(mean_body_mass = mean(body_mass_g)) %>%
slice_min(mean_body_mass, n = 1) # min based on mean_body_mass, just give me 1
#> # A tibble: 1 × 2
#> species mean_body_mass
#> <fct> <dbl>
#> 1 Adelie 3701.
slice()
allows you to pick rows by the position (i.e., indexing, e.g.,slice(4)
will give you the 4th row of that df)slice_head()
andslice_tail()
select the first or last rows.slice_sample()
randomly selects rows.slice_min()
andslice_max()
select rows with highest or lowest values of a variable.
You can use slice_()
functions on dataframes themselves (not just on summarized data).
For example:
penguins %>%
drop_na(body_mass_g) %>% # dropping observations with missing values
slice_max(body_mass_g, n = 10) # the 10 heaviest penguins
#> # A tibble: 11 × 8
#> species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Gentoo Biscoe 49.2 15.2 221 6300 male 2007
#> 2 Gentoo Biscoe 59.6 17 230 6050 male 2007
#> 3 Gentoo Biscoe 51.1 16.3 220 6000 male 2008
#> 4 Gentoo Biscoe 48.8 16.2 222 6000 male 2009
#> 5 Gentoo Biscoe 45.2 16.4 223 5950 male 2008
#> 6 Gentoo Biscoe 49.8 15.9 229 5950 male 2009
#> 7 Gentoo Biscoe 48.4 14.6 213 5850 male 2007
#> 8 Gentoo Biscoe 49.3 15.7 217 5850 male 2007
#> 9 Gentoo Biscoe 55.1 16 230 5850 male 2009
#> 10 Gentoo Biscoe 49.5 16.2 229 5800 male 2008
#> 11 Gentoo Biscoe 48.6 16 230 5800 male 2008
#> # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
Note, we actually got 11, that is because in 10th place there is a tie. Ties are kept by default, but can be turned off by setting with_ties = FALSE
.
You can also set the proportion of results to return using prop =
.
penguins %>%
drop_na(body_mass_g) %>% # dropping observations with missing values
slice_max(body_mass_g, prop = 0.1) # the top 10% heaviest penguins
#> # A tibble: 34 × 8
#> species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Gentoo Biscoe 49.2 15.2 221 6300 male 2007
#> 2 Gentoo Biscoe 59.6 17 230 6050 male 2007
#> 3 Gentoo Biscoe 51.1 16.3 220 6000 male 2008
#> 4 Gentoo Biscoe 48.8 16.2 222 6000 male 2009
#> 5 Gentoo Biscoe 45.2 16.4 223 5950 male 2008
#> 6 Gentoo Biscoe 49.8 15.9 229 5950 male 2009
#> 7 Gentoo Biscoe 48.4 14.6 213 5850 male 2007
#> 8 Gentoo Biscoe 49.3 15.7 217 5850 male 2007
#> 9 Gentoo Biscoe 55.1 16 230 5850 male 2009
#> 10 Gentoo Biscoe 49.5 16.2 229 5800 male 2008
#> # … with 24 more rows, and abbreviated variable names ¹flipper_length_mm,
#> # ²body_mass_g
If data are grouped, the operation will be performed group wise, like we see below.
penguins %>%
drop_na(body_mass_g, species) %>% # dropping observations with missing values
group_by(species) %>%
slice_min(body_mass_g, n = 1)
#> # A tibble: 4 × 8
#> # Groups: species [3]
#> species island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex year
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
#> 1 Adelie Biscoe 36.5 16.6 181 2850 fema… 2008
#> 2 Adelie Biscoe 36.4 17.1 184 2850 fema… 2008
#> 3 Chinstrap Dream 46.9 16.6 192 2700 fema… 2008
#> 4 Gentoo Biscoe 42.7 13.7 208 3950 fema… 2008
#> # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
We get the minimum body_mass_g
for each species
, and in this case we get 2 for Adelie penguins because there is a tie.
Using summarize()
with across()
There are also ways that you can combine summarize()
to function across()
your different variables. The function where()
allows you to pick variables where the function is TRUE
.
penguins %>%
drop_na() %>%
group_by(species) %>%
summarize(across(where(is.numeric), mean))
#> # A tibble: 3 × 6
#> species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie 38.8 18.3 190. 3706. 2008.
#> 2 Chinstrap 48.8 18.4 196. 3733. 2008.
#> 3 Gentoo 47.6 15.0 217. 5092. 2008.
# from column bill_length_mm to column flipper_length_mm
penguins %>%
drop_na() %>%
group_by(species) %>%
summarize(across(bill_length_mm:flipper_length_mm, mean))
#> # A tibble: 3 × 4
#> species bill_length_mm bill_depth_mm flipper_length_mm
#> <fct> <dbl> <dbl> <dbl>
#> 1 Adelie 38.8 18.3 190.
#> 2 Chinstrap 48.8 18.4 196.
#> 3 Gentoo 47.6 15.0 217.
# using the helper contains()
penguins %>%
drop_na() %>%
group_by(species) %>%
summarize(across(contains("mm"), mean))
#> # A tibble: 3 × 4
#> species bill_length_mm bill_depth_mm flipper_length_mm
#> <fct> <dbl> <dbl> <dbl>
#> 1 Adelie 38.8 18.3 190.
#> 2 Chinstrap 48.8 18.4 196.
#> 3 Gentoo 47.6 15.0 217.
All of these helper functions can be used outside of summarize()
too.
Breakout Exercises 2
Let’s practice using slice()
and summarize()
with across()
, still using the flights
data.
Exercise 2.1
What is the tail number (tailnum
) for the plane that has, on average, the least arrival delay?? What about the most arrival delay? How many flights did this plane take in this dataset? How would your answer change if you required that a plane take at least 50 flights?
Hints (click here)
Group the tibble by the tailnum
variable, and summarize to get mean arr_delay
and also n()
. Then pick the right filter()
parameters if necessary, and the appropriate slice_()
function.
Solution (click here)
Least delayed flights
# any number of flights
flights %>%
group_by(tailnum) %>%
summarize(mean_arr_delay = mean(arr_delay),
n_flights = n()) %>%
slice_min(mean_arr_delay, n = 1)
#> # A tibble: 1 × 3
#> tailnum mean_arr_delay n_flights
#> <chr> <dbl> <int>
#> 1 N560AS -53 1
# at least 50 flights
flights %>%
group_by(tailnum) %>%
summarize(mean_arr_delay = mean(arr_delay),
n_flights = n()) %>%
filter(n_flights >= 50) %>%
slice_min(mean_arr_delay, n = 1)
#> # A tibble: 1 × 3
#> tailnum mean_arr_delay n_flights
#> <chr> <dbl> <int>
#> 1 N548UW -13.0 52
Most delayed flights
# any number of flights
flights %>%
group_by(tailnum) %>%
summarize(mean_arr_delay = mean(arr_delay),
n_flights = n()) %>%
slice_max(mean_arr_delay, n = 1)
#> # A tibble: 1 × 3
#> tailnum mean_arr_delay n_flights
#> <chr> <dbl> <int>
#> 1 N844MH 320 1
# at least 50 flights
flights %>%
group_by(tailnum) %>%
summarize(mean_arr_delay = mean(arr_delay),
n_flights = n()) %>%
filter(n_flights >= 50) %>%
slice_max(mean_arr_delay, n = 1)
#> # A tibble: 1 × 3
#> tailnum mean_arr_delay n_flights
#> <chr> <dbl> <int>
#> 1 N8698A 27.5 57
Exercise 2.2
Calculate the median for air_time
, arr_delay
and dep_delay
by origin
. Try to not do each manually.
Hints (click here)
Group the tibble by the origin
variable, and combine summarize()
and across()
to get the median for each variable. Try listing them together using c()
.
Solution (click here)
flights %>%
drop_na(air_time, arr_delay, dep_delay, origin) %>%
group_by(origin) %>%
summarize(across(c(air_time, arr_delay, dep_delay), median,
.names = "{col}_median")) # extra fun thing to rename columns
#> # A tibble: 3 × 4
#> origin air_time_median arr_delay_median dep_delay_median
#> <chr> <dbl> <dbl> <dbl>
#> 1 EWR 130 -4 -1
#> 2 JFK 149 -6 -1
#> 3 LGA 115 -5 -3
Exercise 2.3
Which destinations have the longest maximum arrival delay? Which destinations have the shortest? Pull data for the top 10 for both the longest and shortest maximum arrival delay. Keep track of how many flights there are to each location in case that might be useful information.
Hints (click here)
Group the tibble by the dest
variable, and summarize to get max arr_delay
and also n()
. Then pick the right slice_()
function.
Solution (click here)
Least delayedflights %>%
drop_na(dest, arr_delay) %>%
group_by(dest) %>%
summarize(max_arr_delay = max(arr_delay),
n_flights = n()) %>%
slice_min(max_arr_delay, n = 10)
#> # A tibble: 10 × 3
#> dest max_arr_delay n_flights
#> <chr> <dbl> <int>
#> 1 LEX -22 1
#> 2 PSP 17 18
#> 3 ANC 39 8
#> 4 HDN 43 14
#> 5 EYW 45 17
#> 6 SBN 53 10
#> 7 MTJ 101 14
#> 8 ILM 143 107
#> 9 ABQ 153 254
#> 10 BZN 154 35
Most delayed
flights %>%
drop_na(dest, arr_delay) %>%
group_by(dest) %>%
summarize(max_arr_delay = max(arr_delay),
n_flights = n()) %>%
slice_max(max_arr_delay, n = 10)
#> # A tibble: 10 × 3
#> dest max_arr_delay n_flights
#> <chr> <dbl> <int>
#> 1 HNL 1272 701
#> 2 CMH 1127 3326
#> 3 ORD 1109 16566
#> 4 SFO 1007 13173
#> 5 CVG 989 3725
#> 6 TPA 931 7390
#> 7 MSP 915 6929
#> 8 ATL 895 16837
#> 9 MIA 878 11593
#> 10 LAS 852 5952