Session S03E09: Functional Programming With Apply() functions

Using the apply() functions of base R as an alternative to loops.


Session Goals

  • Continue practicing with loops in R.
  • Describe how the apply() functions relate to loops in R.
  • Use apply functions as alternatives to loops.
  • Identify the input and output formats associated with different apply() functions.

Highlights From Recent Sessions

In the past several sessions, we’ve talked about several things that have relevance to today’s discussion of the apply() functions. Here’s a quick review…

Data Structures And Indexing

There are several widely-used data structures in R. They include vectors, lists, and data frames. As Michael Broe showed in a recent session on data structures, each of these can be indexed, which means we can pull out one or more specific elements from those structures.

Vectors

Vectors have one dimension (can be characterized by their length), and all the elements of any vector in R have to be of the same class. They are often created with the c() (combine) function…

#Create some vectors
num_vector1 <- 1:10
num_vector1

#>  [1]  1  2  3  4  5  6  7  8  9 10

class(num_vector1)

#> [1] "integer"

num_vector2 <- c(1,2,6,10)
num_vector2

#> [1]  1  2  6 10

class(num_vector2)

#> [1] "numeric"

log_vector <- c(TRUE, FALSE, TRUE, FALSE)
log_vector

#> [1]  TRUE FALSE  TRUE FALSE

class(log_vector)

#> [1] "logical"

#Index a vector
num_vector2[c(1,3)]

#> [1] 1 6

num_vector2[log_vector]

#> [1] 1 6

Lists

#Create a list
my_list <- list("num_vec1" = num_vector1,
                "num_vec2" = num_vector2,
                "log_vec" = c(TRUE, FALSE, TRUE, FALSE))

#View the list
my_list

#> $num_vec1
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> $num_vec2
#> [1]  1  2  6 10
#> 
#> $log_vec
#> [1]  TRUE FALSE  TRUE FALSE

#Try some indexing
my_list[2]

#> $num_vec2
#> [1]  1  2  6 10

my_list[[2]]

#> [1]  1  2  6 10

my_list$num_vec2

#> [1]  1  2  6 10

Data Frames

#Create a data frame
my_df <- data.frame("num_vec" = num_vector2,
                "log_vec" = c(TRUE, FALSE, TRUE, FALSE))

#View the data frame
my_df

#>   num_vec log_vec
#> 1       1    TRUE
#> 2       2   FALSE
#> 3       6    TRUE
#> 4      10   FALSE


#OR

my_df <- as.data.frame(my_list[c(2,3)])

my_df

#>   num_vec2 log_vec
#> 1        1    TRUE
#> 2        2   FALSE
#> 3        6    TRUE
#> 4       10   FALSE


#Index the data frame
my_df[2]

#>   log_vec
#> 1    TRUE
#> 2   FALSE
#> 3    TRUE
#> 4   FALSE


my_df[[2]]

#> [1]  TRUE FALSE  TRUE FALSE


my_df$log_vec

#> [1]  TRUE FALSE  TRUE FALSE

Loops

As Jelmer demonstrated in last week’s session, loops allow you to iteratively apply some task(s) to a series of inputs. In the simple loop below, we take each of the three values (1,3,6), print a statement with the original value, then negate the value and print another statement with the updated value…

for (x in c(1,3,6)) {
  print(paste0("Input value is ", x))
  x <- -x
  print(paste0("Negated value is ", x))
}

#> [1] "Input value is 1"
#> [1] "Negated value is -1"
#> [1] "Input value is 3"
#> [1] "Negated value is -3"
#> [1] "Input value is 6"
#> [1] "Negated value is -6"

Functions

We use functions abundantly in R. Even the simple examples above used multiple functions, including c(), which combined items into a vector, class(), which returned the type, or class of an object, and paste0(), which allowed us to stitch together character vectors and objects into a single expression. Functions typically accept (and often require) arguments - pieces of information that are provided inside the parentheses that may provide input for the function or details that modify its behavior. As a simple example, setting the na.rm argument in the mean() function provides a mean for all values in a vector after removing any that are NA. Otherwise, the mean is returned as “NA”…

values <- c(1:5, NA, 7:10)
values

#>  [1]  1  2  3  4  5 NA  7  8  9 10

mean(values)

#> [1] NA

mean(values, na.rm = TRUE)

#> [1] 5.444444

Functionals

In contrast to traditional arguments like the na.rm above, some functions accept other functions as arguments - these are sometimes called functionals. In this session we’ll look at some of the functionals in the apply() group. These provide alternatives to writing loops by allowing us to iteratively apply some function over structures like lists or data frames. They include…

  • apply() - apply some function to the margins (rows or columns) of a rectangular object (i.e. matrix or data frame)
  • lapply() - apply some function to each element of a list
  • sapply() - similar to lapply(), but provides output in a different format
  • mapply() - apply a function to multiple lists

Key to understanding how and when to use each of these is thinking about the structure of the data going in and the structure of the results that get returned. We’ll start with lapply().

lapply()

lapply() allows you to iteratively apply a function to items in a list, and by default, returns a list of results with the same number of entries as the input had. The only required arguments are the list the function will be applied to and the function itself. Keep in mind that these apply() functions are alternatives to loops. We’ll try calculating means with both the loop approach and the apply() approach on the simple_list example below…

simple_list <- list(1:10,
                    11:15,
                    16:30)

simple_list

#> [[1]]
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> [[2]]
#> [1] 11 12 13 14 15
#> 
#> [[3]]
#>  [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Calculate Means With A Loop

num_entries <- length(simple_list)
results_list_loop <- list()

for (i in 1:num_entries) {
  current_mean <- mean(simple_list[[i]])
  results_list_loop[i] <- current_mean
  
}

results_list_loop

#> [[1]]
#> [1] 5.5
#> 
#> [[2]]
#> [1] 13
#> 
#> [[3]]
#> [1] 23

Calculate Means With lapply()

results_list_apply <- lapply(simple_list, mean)

results_list_apply

#> [[1]]
#> [1] 5.5
#> 
#> [[2]]
#> [1] 13
#> 
#> [[3]]
#> [1] 23

Notice we can use a lot less code with lapply() to get the same result as with the for loop.

Give lapply() a try in a Breakout Room…

Breakout Exercises 1

As we’ve talked about before, lists and data frames are closely related data structures in R - data frames are a special type of list in which all the entries are of the same size, and so they can be neatly organized into a rectangular row/column structure. When data fit that rectangular pattern, it’s easy to switch them between lists and data frames.

The code below pulls out the columns of the penguins data frame that are numeric and reformats them into a list named pens_list, which we’re previewing with the str() function.

library(tidyverse)
library(palmerpenguins)
pens_list <- select_if(penguins, is.numeric) %>% as.list()
str(pens_list)

#> List of 5
#>  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#>  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#>  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
#>  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
#>  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Calculate the median value for each of the variables/entries in pens_list.

Hints (click here)


You can write a loop to do this, or, preferably, use lapply(). You’ll need one additional argument (na.rm) for the median() function - see the mean() example above, or check the help for the median() and lapply() functions for more details.


Solution (click here)
# loop option
results_loop <- list()

for (i in 1:length(pens_list)) {
  results_loop[i] <- median(pens_list[[i]], na.rm = TRUE)
}

results_loop

#> [[1]]
#> [1] 44.45
#> 
#> [[2]]
#> [1] 17.3
#> 
#> [[3]]
#> [1] 197
#> 
#> [[4]]
#> [1] 4050
#> 
#> [[5]]
#> [1] 2008


#lapply option
lapply(pens_list, median, na.rm = TRUE)

#> $bill_length_mm
#> [1] 44.45
#> 
#> $bill_depth_mm
#> [1] 17.3
#> 
#> $flipper_length_mm
#> [1] 197
#> 
#> $body_mass_g
#> [1] 4050
#> 
#> $year
#> [1] 2008

You might have noticed that one of the columns is year. We don’t really need to get the median for that, so use lapply() to calculate the medians again, but this time only do it for the first 4 columns..

Hints (click here)


Index the list in the lapply() function with square brackets to apply the function to just the first 4 entries.


Solution (click here)

#lapply option
lapply(pens_list[1:4], median, na.rm = TRUE)

#> $bill_length_mm
#> [1] 44.45
#> 
#> $bill_depth_mm
#> [1] 17.3
#> 
#> $flipper_length_mm
#> [1] 197
#> 
#> $body_mass_g
#> [1] 4050

Try the same code again, but this time run it with sapply() instead of lapply(). What’s the difference in these two functions?

Hints (click here)


Simply replace lapply() from the previous exercise with sapply().


Solution (click here)
sapply(pens_list[1:4], median, na.rm = TRUE)

#>    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
#>             44.45             17.30            197.00           4050.00

apply()

lapply() allowed us to apply a function to separate entries in a list. apply() does something similar, but applies the function to the margins (rows or columns) of objects with two dimensions like data frames or matrices.

Let’s start with a simple matrix…

simple_mat <- matrix(1:15, nrow = 3)

simple_mat

#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    4    7   10   13
#> [2,]    2    5    8   11   14
#> [3,]    3    6    9   12   15

Now we’ll use apply() to get means for entries in simple_mat. Like lapply(), apply() requires that we provide arguments to define the object the function will be applied to and the function itself. But since with apply() the function can either be applied to the rows or columns, we need a third argument to specify which we want. This is done with either a ‘1’ for rows or a ‘2’ for columns…

Get The Mean For Each Column

apply(simple_mat, 2, mean)

#> [1]  2  5  8 11 14

Get The Mean For Each Row

apply(simple_mat, 1, mean)

#> [1] 7 8 9

Breakout Exercises 2

The code below will download a dataframe that contains average monthly temperature data for 282 US locations from 1981-2010, reformat it a bit to make it easier to work with, and store it as the object temp_data.

temp_data <- read_csv('https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/temperature/city_temp_data_noaa.csv') %>%
  unite("Location",
        City, State,
        sep = " ") %>%
  column_to_rownames("Location")

Preview temp_data. How is it structured? What do the rows and columns represent?

Hints (click here)


Use head() or glimpse() to preview the dataset.


Solution (click here)
head(temp_data)

#>                   JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  OCT  NOV  DEC
#> BIRMINGHAM AP AL 53.8 58.4 66.7 74.4 81.5 87.7 90.8 90.6 85.1 75.3 65.4 55.9
#> HUNTSVILLE AL    51.2 55.9 64.9 73.6 81.3 88.2 90.7 90.9 85.0 74.6 63.7 53.5
#> MOBILE AL        60.0 63.2 69.8 76.1 83.0 88.2 90.4 90.5 87.3 79.4 70.3 61.9
#> MONTGOMERY AL    57.4 61.8 69.7 76.6 84.0 89.8 92.1 91.9 87.3 78.3 69.0 59.6
#> ANCHORAGE AK     23.1 26.6 33.9 44.5 56.0 62.8 65.4 63.5 55.1 40.5 27.8 24.8
#> ANNETTE AK       41.6 42.7 44.9 50.2 56.3 61.1 64.3 64.7 59.3 51.6 44.6 41.5

Now calculate the mean temperature for each month. Based on the locations sampled, what month is the warmest overall? The coldest?

Hints (click here)


Use apply() to calculate the means for each column (columns are designated with ‘2’).


Solution (click here)
apply(temp_data, 2, mean)

#>      JAN      FEB      MAR      APR      MAY      JUN      JUL      AUG 
#> 44.35709 48.03156 55.75035 64.83936 73.33227 80.73582 84.88865 83.81418 
#>      SEP      OCT      NOV      DEC 
#> 77.35922 66.87234 55.59610 46.22589

Now calculate the mean temperature for each location. Which location has the warmest annual temperature? The coldest? Since there are a lot of results to sort through, consider using indexing to extract the warmest and coldest temperatures.

Hints (click here)


Use apply() to calculate the means for each row (rows are designated with ‘1’). Save the results to an object, and then use logical indexing in combination with the max() function to pull out the entry with the maximum value or min() to pull out the minimum value.


Solution (click here)
row_means <- apply(temp_data, 1, mean)
row_means[row_means == max(row_means)]

#> POHNPEI-CAROLINE IS. PC 
#>                  88.175

row_means[row_means == min(row_means)]

#> BARROW AK 
#>  17.18333

How many locations have a mean temp > 75F?

Hints (click here)


Use indexing like in the previous exercise. You can print the results, or use the length function to get the number returned.


Solution (click here)
row_means[row_means > 75] %>% length()

#> [1] 68

Bonus

How many states or territories have at least one city in the dataset with a mean temp > 75F?

Hints (click here)


The states or territories are given by the last 2 characters in the row names of the data frame (which become the names of the vector elements in the results of apply()). Extract the set of names, use a regular expression to pull out the last two characters from each (consider stringr::str_rep, or gsub()), then unique them to get each one that’s represented and find the length of that vector.


Solution (click here)
loc_names <- row_means[row_means > 75] %>% names() 
states <- stringr::str_replace(loc_names, "(.+)(..)", "\\2" )
unique_states <- unique(states)
unique_states

#>  [1] "AL" "AZ" "CA" "FL" "GA" "HI" "LA" "MS" "NV" "NM" "SC" "TX" "PC" "PR"


length(unique_states)

#> [1] 14


Purrr: An Alternative (Tidy) Approach To apply() Functions

There are tidy alternatives to the apply functions - they’re part of the purrr package, which we’ll explore in the next session. In the meantime, if you want a preview, you can find details on purrr here.



Mike Sovic
Mike Sovic
Bioinformatician at CAPS