Session S03E10: Functional Programming With purrr::map() Functions

Using functions from tidy’s purrr package as an alternative to loops.


Session Goals

  • List and differentiate between some useful purrr functions.
  • Compare purrr functions to the apply() functions from last session.
  • Use purrr functions as alternatives to loops.

Highlights From Last Session

In the previous session, we explored some functions from the apply() family of functions, which provided alternatives to writing loops to iteratively apply some function to the elements of a data structure such as a vector, list, or data frame. Let’s start with a simple list with three entries…

Generate An Example List

our_list <- list(A = 1:10, B = 11:20, C = 21:30)
our_list

#> $A
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $B
#>  [1] 11 12 13 14 15 16 17 18 19 20
#> 
#> $C
#>  [1] 21 22 23 24 25 26 27 28 29 30

We could use a loop to calculate the mean for each of the entries.

Calculate Mean For Each List Entry With A Loop

results_loop <- list()

for (i in 1:length(our_list)) {
  results_loop[i] <- mean(our_list[[i]], na.rm = TRUE)
}

results_loop

#> [[1]]
#> [1] 5.5
#> 
#> [[2]]
#> [1] 15.5
#> 
#> [[3]]
#> [1] 25.5

Like we saw last week, lapply() gave us an alternative way to do the same thing, and with simpler and clearer code…

Calculate Mean For Each Variable With lapply()

res_lapply <- lapply(our_list, mean, na.rm = TRUE)
res_lapply

#> $A
#> [1] 5.5
#> 
#> $B
#> [1] 15.5
#> 
#> $C
#> [1] 25.5

I mentioned in the previous session that when working with the apply() functions, it’s important to think about the structure/type of data going in to the function, and also that getting returned by the function. We saw that the lapply() example above returned a list, and can confirm that with…

class(res_lapply)

#> [1] "list"

We also saw that sapply() was very similar to lapply(), but instead of a list being returned, the results were condensed down to a vector - specifically in this case, a numeric vector…

res_sapply <- sapply(our_list, mean, na.rm = TRUE)
res_sapply

#>    A    B    C 
#>  5.5 15.5 25.5

class(res_sapply)

#> [1] "numeric"

map() Functions From purrr

In this session, we’re going to look at some of the map() functions from the purrr package, which is part of the tidyverse. In some cases, these functions return the same results as their apply() analogues. As an example, compare the map() function to lapply()

library(tidyverse)
res_map <- map(our_list, mean, na.rm = TRUE)
res_map

#> $A
#> [1] 5.5
#> 
#> $B
#> [1] 15.5
#> 
#> $C
#> [1] 25.5

So why might you want to use the map() functions? There are two primary reasons…

  1. The syntax/names of the map() functions might be easier to understand and make for clearer code.
  2. The map() family provides additional functionality that might at best be cumbersome to achieve with the base R approaches. We won’t get to this in this session, but see the imap() function for one example.

Ultimately, much of the value in using these map() functions gets realized when you start writing your own custom functions, which is something we haven’t done yet, but will be doing soon. For now, we’ll work with some fairly basic examples just to get introduced to some of the syntax and usage of these functions. There’s a purrr cheatsheet that you might find helpful available at https://raw.githubusercontent.com/rstudio/cheatsheets/main/purrr.pdf

We’ll start working with map() in the first breakout session…

Breakout Exercises 1

Like with the first breakout exercise from last week, below I’m pulling out a subset of the numeric variables available in the penguins data frame and reformating them into a list named pens_list that we’ll use to practice with map() functions.

library(tidyverse)
library(palmerpenguins)
pens_list <- select_if(penguins, is.numeric) %>% select(-year) %>% as.list()
str(pens_list)

#> List of 4
#>  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#>  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#>  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
#>  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...

Use map() to calculate the mean value for each of the variables/entries in pens_list. What type of results are returned (i.e. is it a list, vector, data frame, etc)?

Hints (click here)


Apply the mean() function with map(). Remember there are NA’s in the data - see the help for mean() for dealing with those. You can view the result, or try class() to get the type of object returned.


Solution (click here)
res_map <- map(pens_list, mean, na.rm = TRUE)
res_map

#> $bill_length_mm
#> [1] 43.92193
#> 
#> $bill_depth_mm
#> [1] 17.15117
#> 
#> $flipper_length_mm
#> [1] 200.9152
#> 
#> $body_mass_g
#> [1] 4201.754

class(res_map)

#> [1] "list"

As we saw last week, and in the example above, sapply() is similar to lapply(), but returns results as a vector instead of as a list. Take a look at the help for map() and find a function that will return a vector of doubles (numerics) and apply it to the pens_list object like you just did with map(). Then find another function that will return the results as a data frame.

Hints (click here)


Try the map_dbl() and map_dfc() functions.


Solution (click here)
map_vec <- map_dbl(pens_list, mean, na.rm = TRUE)
map_vec

#>    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
#>          43.92193          17.15117         200.91520        4201.75439


map_df <- map_dfc(pens_list, mean, na.rm = TRUE)
map_df

#> # A tibble: 1 x 4
#>   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>            <dbl>         <dbl>             <dbl>       <dbl>
#> 1           43.9          17.2              201.       4202.

Remember that with map(), the input is a single list or vector, and the function is applied to each element of the list of vector. A number of variants of the map() function are available that define the type of output that gets returned.

map2()

While map() allowed us to apply some function to the elements of a single list or vector, map2() lets us apply some operation to paired elements from two lists (or vectors).

our_list1 <- list(1:10)
our_list1

#> [[1]]
#>  [1]  1  2  3  4  5  6  7  8  9 10

our_list2 <- list(11:20)
our_list2

#> [[1]]
#>  [1] 11 12 13 14 15 16 17 18 19 20

Here’s we’ll use map2() to get the sums of corresponding pairs of elements from the two lists…

map2(our_list1, our_list2, ~ .x + .y)

#> [[1]]
#>  [1] 12 14 16 18 20 22 24 26 28 30

The tilde (~) in the third argument indicates a formula that will be converted to a function and applied. It’s actually kind of a short-hand way to write and apply a custom function, but since we haven’t gotten in to writing our own functions yet (though that’s coming soon!), for now, just remember that the function passed to map2() has to take two arguments, which are defined as ‘.x’ and ‘.y’ (elements in corresponding positions of the first and second lists or vectors, respectively) in each iteration of the function.

Breakout Exercises 2

Below are two vectors containing bill-measurement data from the penguins data frame (bill length and bill depth).

bill_length_mm <- pens_list$bill_length_mm
bill_depth_mm <- pens_list$bill_depth_mm

Use map2() to calculate the bill ratio for each penguin (length/depth). Output the result as a vector containing doubles (numerics), and save it as the object bill_ratios.

Hints (click here)


Apply the map2_dbl() function and use the third argument to define a formula that divides the length value by the width value.


Solution (click here)
bill_ratios <- map2_dbl(bill_length_mm, bill_depth_mm, ~ .x / .y)

How many of the penguins in the dataset have bill ratios greater than 3?

Hints (click here)


Logicals in R can be interpreted as 1/0 (TRUE/FALSE). Try using sum() to sum over the results of a logical expression to get the number of ratios > 3. Alternatively, you could index bill_ratios to retain just the values > 3, and then get the length of that vector. Remember that there are NA’s mixed in - how does this affect each of these two approaches?


Solution (click here)
sum(bill_ratios > 3, na.rm = TRUE)

#> [1] 109


# OR

bill_ratios[bill_ratios > 3] %>% na.omit() %>% length()

#> [1] 109

Bonus

Here’s one more dataset that’s based on the penguins data (though much of it is made up). It represents measurements taken for each of three penguins over three years. List1 has data for year 1, List2 has data for year 2, and List3 has data for year 3.

yr1_list <- list("Pen_1" = c("bill_length_mm" = 39.1, "bill_depth_mm" = 18.7, "flipper_length_mm" = 181, "body_mass_g" = 3750),
             "Pen_2" = c("bill_length_mm" = 39.5, "bill_depth_mm" = 17.4, "flipper_length_mm" = 186, "body_mass_g" = 3800),
             "Pen_3" = c("bill_length_mm" = 40.3, "bill_depth_mm" = 18, "flipper_length_mm" = 195, "body_mass_g" = 3250))

yr2_list <- list("Pen_1" = c("bill_length_mm" = 39.8, "bill_depth_mm" = 18.9, "flipper_length_mm" = 184, "body_mass_g" = 3767),
             "Pen_2" = c("bill_length_mm" = 38.7, "bill_depth_mm" = 17.2, "flipper_length_mm" = 186, "body_mass_g" = 3745),
             "Pen_3" = c("bill_length_mm" = 40.7, "bill_depth_mm" = 18.6, "flipper_length_mm" = 217, "body_mass_g" = 3470))

yr3_list <- list("Pen_1" = c("bill_length_mm" = 40.2, "bill_depth_mm" = 19.3, "flipper_length_mm" = 188, "body_mass_g" = 3790),
             "Pen_2" = c("bill_length_mm" = 38.4, "bill_depth_mm" = 17.0, "flipper_length_mm" = 187, "body_mass_g" = 3710),
             "Pen_3" = c("bill_length_mm" = 40.9, "bill_depth_mm" = 18.9, "flipper_length_mm" = 228, "body_mass_g" = 3493))

Try calculating the average value for each of the variables for each penguin over the three years. Output the results as a data frame. Note you’ll need an new map() function that we haven’t used yet. Take a look at the help for map2(), or the purrr cheatsheet, to find a similar function that works on more than two lists (takes functions with 3 or more required arguments).

Hints (click here)

The pmap() can be used here. It has some similarities to map2(), but instead of applying to 2 lists, pmap() works with 3 or more. Notice the names of the lists the function will be applied to are given as a single argument (a list). Provide a formula to calculate the mean. Unlike the map2() function, which only works on functions that take 2 arguments (denoted ‘.x’ and ‘.y’), the number of arguments passed to the function used in pmap() can be three or more, and they are denoted ‘..1’, ‘..2’, ‘..3’, etc.


Solution (click here)
pmap_dfr(list(yr1_list, yr2_list, yr3_list), ~ (..1 + ..2 + ..3)/3)

#> # A tibble: 3 x 4
#>   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>            <dbl>         <dbl>             <dbl>       <dbl>
#> 1           39.7          19.0              184.       3769 
#> 2           38.9          17.2              186.       3752.
#> 3           40.6          18.5              213.       3404.

Mike Sovic
Mike Sovic
Bioinformatician at CAPS