Session 13: Applying The Apply Functions

Using the apply() functions of base R as an alternative to loops.


Session Goals

  • List the functions in the apply() family of functions from base R.
  • Describe how the apply() functions relate to loops in R.
  • Identify the input and output formats associated with different apply() functions.
  • Identify appropriate apply() functions for different scenarios.
  • Use apply() functions to explore some US state temperature data.

Intro: The apply() Functions

R is sometimes referred to as a functional programming language, and the apply() family of functions from base R is an example of this functional programming. Let’s first take a look at some available functions - they include…

Last week in session 12, Jelmer introduced for loops as one method for iterating over some set of things in R. Let’s briefly revisit one of his examples. First, we’ll recreate his distance dataset…

#distance data (km) for two dates
dists_Mar4 <- c(17, 93, 56, 19, 175, 40, 69, 267, 4, 91)
dists_Mar5 <- c(87, 143, 103, 223, 106, 18, 87, 72, 59, 5)

dist_df <- data.frame(dists_Mar4, dists_Mar5)

#view the data frame
dist_df

#>    dists_Mar4 dists_Mar5
#> 1          17         87
#> 2          93        143
#> 3          56        103
#> 4          19        223
#> 5         175        106
#> 6          40         18
#> 7          69         87
#> 8         267         72
#> 9           4         59
#> 10         91          5

As he showed, one way to get the median distance traveled for each day (column) is to iterate over each column with a for loop, applying the median() function to each one…

#create object to store the loop output
column_medians <- vector(length = ncol(dist_df))

#for loop to calculate median for each column
for (column_number in 1:ncol(dist_df)) {
  
  ## We extract one column using "dataframe_name[[column_number]]":
  column_median <- median(dist_df[[column_number]])
  
  ## We add the single-column median to its associated position
  ## in the vector:
  column_medians[column_number] <- column_median
}

#view the result
column_medians

#> [1] 62.5 87.0

Let’s think of this loop as the “programming” part of the functional programming I mentioned earlier - we’ve written, or programmed, some code the computer will execute for us - we’ll get to the “functional” part of functional programming shortly.

Unless you’re brand new to R, you’ve probably realized by now that there are a few data structures you find yourself working with pretty frequently. These include data frames, matrices, and lists. Not only do these get used a lot, but there are also certain operations that get performed pretty frequently on each of those types of objects. For example, doing something like iterating over either the rows or columns of a data frame and applying some function to each, like we did with the median function in the data frame above, is pretty common. That means lots of people would end up independently writing for loops that would look a lot like the one in our example. This is where the “functional” part of “functional programming” starts to come in. Instead of everyone independently writing that same basic loop over and over, it can be written one time in a general form and packaged into a function that can be called instead. And this is what the apply() functions do. Then, going one step further, functional programming allows us to pass individual functions as arguments to other functions, as we’re going to see shortly. Let’s take a look at some examples.



Examples

apply()

We’ll start with the apply() function, which we can use to iterativey apply some function to the margins (rows or columns) of an object that has “row by column”" structure. There are three arguments that have to be passed to apply() - the object containing the data, the margin the function will be applied to (rows are designated with ‘1’, columns with ‘2’), and the function of interest.

In the example above, we used a loop to apply the median() function to each column of the data frame. Here, we’ll do the same thing with apply(), by passing the median() function as an argument to apply()

apply_out <- apply(dist_df, 2, median)

#view the result
apply_out

#> dists_Mar4 dists_Mar5 
#>       62.5       87.0

Notice how much less code it required here to do the same thing we did with the for loop above!

Notice too that the output here is a vector (specifically, a named numeric vector). The apply() function determined this was most appropriate in this case, since the output of each iteration consisted of a single value. Here’s another scenario…

apply_out_quantiles <- apply(dist_df, 
                             2, 
                             quantile, 
                             probs = c(0.25, 0.5, 0.75))

#view the result
apply_out_quantiles

#>     dists_Mar4 dists_Mar5
#> 25%      24.25      62.25
#> 50%      62.50      87.00
#> 75%      92.50     105.25

This time, the function output consisted of 3 values for each iteration, or column of the data frame. In this case, the output from apply is a matrix.

A quick additional note about how the function above is structured. In it, we applied the quantile() function to each column, passing the probs argument to it to define the specific quantiles we wanted it to return. If we were running quantile() by itself (not in the context of apply()), it might look like this…

quantile(dists_Mar4, probs = c(0.25, 0.50, 0.75))

#>   25%   50%   75% 
#> 24.25 62.50 92.50

Notice the slight difference in how the probs argument is passed to the quantile() function here versus inside the apply() function above. Here, probs is inside a set of parentheses associated with the function. But inside the apply() function, any arguments associated with the function get passed as a separate argument (separated from the function by a comma). If you check out the apply() documentation, this is indicated with the “…” argument, which is described as “optional arguments to FUN”. You’ll see this kind of thing show up in other functions too.

So, what about the other types of apply() functions? Well, the different types are designed for different types of input. For example…

lapply()

Remember that apply() requires you to define whether you’ll apply the function in a row-wise or column-wise manner. But lists aren’t set up as rows and columns. So, if we want to iterate over the elements of a list, apply() won’t work. An alternative is lapply().

In the next example, we’ll add some new distance data in for two additional dates. The number of observations are different this time though, so the data can’t be combined in a data frame (you might remember that a data frame is a special kind of list where each of the list elements are the same length). Since we have different lengths here, we’ll store the data as a list…

#create a list that includes the new distance data

dists_Mar11 <- c(45, 34, 100, 40, 29, 88, 84, 102)
dists_Mar12 <- c(90, 50, 19, 123, 77, 13, 70)

dist_ls <- list(dists_Mar4, dists_Mar5, dists_Mar11, dists_Mar12)

#view the list
dist_ls

#> [[1]]
#>  [1]  17  93  56  19 175  40  69 267   4  91
#> 
#> [[2]]
#>  [1]  87 143 103 223 106  18  87  72  59   5
#> 
#> [[3]]
#> [1]  45  34 100  40  29  88  84 102
#> 
#> [[4]]
#> [1]  90  50  19 123  77  13  70

Now we’ll apply the median() function to each element of the list. Again, we could write a for loop to iterate over each list element, but lapply() will do the same thing with much less code to write…

lapply_out <- lapply(dist_ls, median)

#view the output
lapply_out

#> [[1]]
#> [1] 62.5
#> 
#> [[2]]
#> [1] 87
#> 
#> [[3]]
#> [1] 64.5
#> 
#> [[4]]
#> [1] 70

This time, the output is a list - lapply() always gives output in list format. But in this specific case, the output could just as easily (and maybe more simply) be stored as a vector of four values - one for each list element. sapply() is an alternative to lapply() that, like lapply() still works on list input, but that attempts to simplify the output where possible…

sapply()

sapply_out <- sapply(dist_ls, median)

#view the output
sapply_out

#> [1] 62.5 87.0 64.5 70.0

Those three: apply(), lapply(), and sapply() are the apply functions you’ll likely encounter most frequently, but there are others that apply in more specific cases - we’ll take a look at at least one more later in the Bonus section.


Breakout Rooms

We’ll work with a new temperature dataset for the Breakout Room Exercises. I’ve filtered and cleaned these data from the original dataset that’s available from climate.gov They consist of maximum average temperature values for three states - Colorado, Ohio, and Virginia, with years in rows and months in columns. You can download the data with this code…

library(tidyverse)

temp_url <- 'https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/temperature/co_oh_va_max_temp.txt'
temp_file <- 'state_max_temps.tsv'
download.file(url = temp_url, destfile = temp_file)

Exercise 1

First let’s load the dataset and assign it to an object named ‘maxtemps’. Then preview the dataset and determine its dimensions (number of rows and columns). As the ‘.tsv’ extension on the file suggests, this is a tab delimited file.

Hints (click here)


Use read_tsv() to load the dataset. The functions head() and glimpse() are a couple good options for previewing the data. If you don’t get the dimensions from the function you preview the data with, the dim() function will provide this info.

Solution (click here)
maxtemps <- read_tsv("state_max_temps.tsv")

#> Parsed with column specification:
#> cols(
#>   STATE = col_character(),
#>   YEAR = col_double(),
#>   JAN = col_double(),
#>   FEB = col_double(),
#>   MAR = col_double(),
#>   APR = col_double(),
#>   MAY = col_double(),
#>   JUN = col_double(),
#>   JUL = col_double(),
#>   AUG = col_double(),
#>   SEP = col_double(),
#>   OCT = col_double(),
#>   NOV = col_double(),
#>   DEC = col_double()
#> )

glimpse(maxtemps)

#> Rows: 378
#> Columns: 14
#> $ STATE <chr> "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO…
#> $ YEAR  <dbl> 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 190…
#> $ JAN   <dbl> 33.6, 41.2, 34.2, 32.6, 34.0, 40.8, 39.4, 38.4, 37.5, 36.6, 33.…
#> $ FEB   <dbl> 33.3, 41.3, 36.8, 43.4, 29.4, 39.1, 37.9, 43.2, 28.3, 47.3, 33.…
#> $ MAR   <dbl> 46.1, 44.8, 42.8, 43.5, 43.7, 51.9, 46.0, 44.3, 44.1, 51.4, 48.…
#> $ APR   <dbl> 60.8, 58.3, 55.8, 59.0, 58.7, 52.4, 55.5, 59.0, 55.6, 58.3, 53.…
#> $ MAY   <dbl> 66.5, 68.0, 70.5, 61.8, 66.5, 68.1, 67.6, 68.7, 63.8, 65.5, 61.…
#> $ JUN   <dbl> 74.0, 80.2, 77.4, 76.6, 77.2, 80.1, 76.8, 78.8, 69.8, 71.7, 77.…
#> $ JUL   <dbl> 77.7, 80.9, 82.0, 81.9, 80.2, 82.9, 86.7, 80.3, 80.9, 79.0, 79.…
#> $ AUG   <dbl> 80.0, 81.2, 79.6, 82.0, 79.8, 81.9, 80.7, 80.8, 80.3, 77.8, 81.…
#> $ SEP   <dbl> 75.7, 70.1, 75.2, 72.2, 74.9, 70.2, 72.7, 72.3, 70.0, 72.2, 73.…
#> $ OCT   <dbl> 60.3, 59.2, 59.8, 57.7, 58.0, 62.9, 63.1, 61.8, 62.4, 60.2, 57.…
#> $ NOV   <dbl> 42.3, 42.8, 48.5, 43.0, 50.5, 49.3, 52.9, 46.7, 49.7, 53.0, 48.…
#> $ DEC   <dbl> 33.9, 43.1, 33.4, 30.9, 35.0, 41.5, 38.6, 36.4, 41.8, 39.8, 35.…

dim(maxtemps)

#> [1] 378  14

Exercise 2

The dataset is currently in tibble form. This is the default object type created by the read_tsv() command from readr (common in tidy workflows). The apply functions are not associated with the tidyverse, and it turns out they sometimes don’t work well with tibbles. So, before we go any further, let’s convert the tibble to a data frame.

Hints (click here)

Use the as.data.frame() function to convert the tibble to a data frame.


Solution (click here)
maxtemps <- as.data.frame(maxtemps)
class(maxtemps)

#> [1] "data.frame"

Exercise 3

Calculate the average temperature for each month across the whole dataset (using the data for all three states together).

Hints (click here)


Choose an appropriate function from the apply() family of functions and use the mean() function to calculate the mean value for each column of temperatures in the dataset (cols 3 through 14). Remember that when you’re designating the margin to apply the function to, ‘1’ means rows and ‘2’ means columns.

Solution (click here)
mean_monthly <- apply(maxtemps[,3:14], 2, mean)

#OR

mean_monthly <- sapply(maxtemps[,3:14], mean)

#Remember that a data frame is just a special case of a list (one that's structured in rows and columns), so either `apply()` or `sapply()` will work here

#view results
mean_monthly

#>      JAN      FEB      MAR      APR      MAY      JUN      JUL      AUG 
#> 38.89206 41.93598 50.73148 61.35608 71.26111 79.95503 84.17460 82.23571 
#>      SEP      OCT      NOV      DEC 
#> 75.92857 64.56296 51.49921 41.11508

Exercise 4

Now let’s get the average annual (max) temperatures for Ohio for all the years available in the dataset (1895-2020) and view the temperatures for the first 5 years of the dataset (1895-1899). Since it’s not really obvious what each of these values correspond to, try converting this vector to a named vector with the years serving as the names.

Hints (click here)


Use the same apply() and mean() functions as above, but this time, filter the dataset for just the “OH” entries, and also apply the function by rows. Remember that a two-dimensional object like a data frame or matrix is indexed with the form [rows, columns]. Alternatively, you can use tidy notation (i.e. filter, select). Then index the resulting vector with the square bracket notation (Session 9) to get the first five items. The names() function will allow you to add names to the vector elements.

Solution (click here)
#base R indexing...
mean_annual_oh <- apply(maxtemps[maxtemps$STATE == "OH", 3:14], 1, mean)

#OR 

#a more tidy approach (actually a hybrid approach here - the apply function is still base R)...
mean_annual_oh <- maxtemps %>% 
                  filter(STATE == "OH") %>% 
                  select(JAN:DEC) %>% 
                  apply(1, mean)

#view first 5 items
mean_annual_oh[1:5]

#> [1] 60.23333 60.74167 61.20833 61.42500 61.59167


#add names to the vector
names(mean_annual_oh) <- 1895:2020

#view first 5 items
mean_annual_oh[1:5]

#>     1895     1896     1897     1898     1899 
#> 60.23333 60.74167 61.20833 61.42500 61.59167

Bonus 1

What if we wanted to compare the mean max July temperatures for each of the three states? Use an appropriate apply() function to calculate the mean values for July separately for CO, OH, and VA.

Hints (click here)


tapply() allows you to apply a function to subsets of a vector that are defined by a set of grouping variables (factors). Check the help page for tapply() and use the “STATE” column as the grouping factor.

Solution (click here)
tapply(maxtemps[,"JUL"], maxtemps$STATE, mean)

#>       CO       OH       VA 
#> 82.25238 84.53810 85.73333

Bonus 2

Now, instead of focusing on just July, let’s try to get the average max temperatures for each month for each of the three states separately.

Hint 1 (click here)


The tapply() function we used in Exercise 4 only works when the input is a single vector. Look toward the end of the tapply() documentation for a suggested related function that might apply here.

Hint 2 (click here)


Give the aggregate() function a try. Notice that the grouping variable (the “by” argument in the function) has to be provided in the form of a list.

Solution (click here)
aggregate(maxtemps[,3:14], by = list(maxtemps$STATE), mean)

#>   Group.1      JAN      FEB      MAR      APR      MAY      JUN      JUL
#> 1      CO 36.85238 40.45952 47.44444 56.49762 66.00952 76.75238 82.25238
#> 2      OH 35.15476 37.99444 48.45238 61.05714 72.28571 80.70000 84.53810
#> 3      VA 44.66905 47.35397 56.29762 66.51349 75.48810 82.41270 85.73333
#>        AUG      SEP      OCT      NOV      DEC
#> 1 79.99286 72.51508 60.87381 47.09365 37.73333
#> 2 82.60952 76.72619 64.47778 50.32698 38.54365
#> 3 84.10476 78.54444 68.33730 57.07698 47.06825

Purrr: An Alternative (Tidy) Approach To apply() Functions

In the second exercise, we converted back from a tibble to a data frame, as the apply() functions we’ve worked with here are part of base R, and some aren’t compatible with tibbles. It’s worth mentioning that there are tidy alternatives to the apply functions - they’re part of the purrr package, which might be the topic of a future code club session. We decided to go with apply() in this session since there were a couple requests for it, and it still does get used enough that you’re likely to at least run across it, even if you don’t use it yourself. For now though, if you want more details on purrr you can find them here.



Mike Sovic
Mike Sovic
Bioinformatician at CAPS