Session S03E12: Incorporating your own functions into loops





New To Code Club?

  • First, check out the Code Club Computer Setup instructions, which also has some pointers that might be helpful if you’re new to R or RStudio.

  • Please open RStudio before Code Club to test things out – if you run into issues, join the Zoom call early and we’ll troubleshoot.


Session Goals

  • Learn how to incorporate your own functions into loops.
  • Learn how to efficiently save the outputs of your loop into a data structure.
  • Learn how using a functional (like purr::map) saves you a lot of housekeeping.

Again we’ll be using tibble() from the tidyverse package, so we need to load that first.

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#>  ggplot2 3.3.5      purrr   0.3.4
#>  tibble  3.1.6      dplyr   1.0.8
#>  tidyr   1.2.0      stringr 1.4.0
#>  readr   2.1.2      forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()

We’ll also reuse the toy data frame from last Code Club:

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df
#> # A tibble: 10 × 4
#>          a        b      c      d
#>      <dbl>    <dbl>  <dbl>  <dbl>
#>  1 -1.57    0.647    1.39   0.851
#>  2  0.239   0.667   -0.108 -1.74 
#>  3 -0.520  -0.663   -0.343  0.652
#>  4 -0.0359 -1.69    -1.30  -1.58 
#>  5  1.27    0.357    0.158 -1.92 
#>  6 -1.04    0.490    0.897  1.33 
#>  7 -0.212  -0.753   -1.68  -0.503
#>  8  1.91    0.275    0.646  0.139
#>  9 -0.535  -0.00632 -1.02  -0.467
#> 10  0.223  -0.422    0.616 -0.553

And we’ll also be re-using our own normalize function:

normalize <- function(x) {
  rng <- range(x)
  (x - rng[1]) / (rng[2] - rng[1])
}

Last time we saw how to use this function to simplify our code:

df$a <- normalize(df$a)
df$b <- normalize(df$b)
df$c <- normalize(df$c)
df$d <- normalize(df$d)

In previous Code Clubs we’ve seen how you can apply a built-in function like mean to each column of a data frame using a for loop, lapply, or map.

We can use exactly the same techniques with our own functions.

But I think it’s worth taking advantage of this time to revisit a couple of details (some of which were in the Bonus Material in S03E08).

Accessing by value vs. index

In our first session on loops, we saw an example like this:

for (a_number in c(10, 11, 12, 13)) { # We iterate over 10, 11, 12, 13
  print(a_number * -1)
}
#> [1] -10
#> [1] -11
#> [1] -12
#> [1] -13

Here we are looping over the actual values in the vector. But we can also access the values by their index. Here we loop over an index, and include that index in the body of the loop. It is very common in this usage to use the name i for the variable. This is most common when the vector/list/data frame already exists as an object:

numbers <- c(10, 11, 12, 13) # We create a vector

for (i in 1:4) {             # We iterate over the indexes 1, 2, 3, 4
  print(numbers[i] * -1)     # We access the value using the index notation `[ ]`
}
#> [1] -10
#> [1] -11
#> [1] -12
#> [1] -13

Note that here we ‘hard-coded’ the length of the vector inside the loop. We can generalize this so it will work on vectors of any length by using this syntax:

numbers <- c(10, 11, 12, 13)

length(numbers)
#> [1] 4

for (i in 1:length(numbers)) { # We iterate over 1, 2, 3,...
  print(numbers[i] * -1)
}
#> [1] -10
#> [1] -11
#> [1] -12
#> [1] -13

Storing loop outputs

We have also seen that unless you issue a print() statement which runs on every separate iteration of the loop, the output values simply ‘go away’.

for (a_number in c(10, 11, 12, 13)) {
  a_number * -1
}

Similarly if we want to actually save the output of the loop in a vector, we need to save an output value on every separate iteration of the loop. And this means we have to build the output vector iteration-by-iteration. Here is a first guess how to do this:

outputs <- vector()                    # We 'initialize' an *empty vector* to hold the outputs

outputs
#> logical(0)

for (a_number in c(10, 11, 12, 13)) {
  outputs <- c(outputs, a_number * -1) # Each time round the loop we *append* a new value to the existing vector
}

outputs
#> [1] -10 -11 -12 -13

This looks fine, however, there is a problem. The vector ‘grows’ at each iteration, and this means that, as Jelmer pointed out in the bonus material on loops, ‘R has to create an entirely new object in each iteration of the loop, because the object’s memory requirements keep increasing.’

This is not an issue for the toy vector we are using here, but say you were using a loop to create a data frame, column by column, with thousands of rows, and hundreds of columns. On every iteration the entire data frame would have to be copied and extended, and copied and extended, and…

So how do we avoid that?

The technique is to initialize a vector (or list, or data frame) of the appropriate size for the outputs, which preallocates the memory required to store it. Then instead of appending to it on each iteration, we write into it on each iteration. The size of the output vector is already fixed, and modifying values like this is way more efficient. Again, the magic is is to use indexes.

output_vector <- vector(length = 4)

output_vector
#> [1] FALSE FALSE FALSE FALSE

numbers <- c(10, 11, 12, 13)

for (i in 1:4) { 
  output_vector[i] <- numbers[i] * -1
}

output_vector
#> [1] -10 -11 -12 -13

Breakout rooms, storing loop outputs

Exercise 1

R has a function letters which returns a character vector:

letters
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"

(letters is a bit like iris: it’s a character vector which is ‘just there’, like iris is a data frame which is ‘just there’).

The tidyverse also has a function str_to_upper() which converts the case of a character:

str_to_upper("a")
#> [1] "A"

Write a for loop that converts each element of a character vector to upper case, saving the output by writing the output of each iteration into an empty vector.

Hints (click here)
What is `letters[1]`?

Solution (click here)
upper_case <- vector(length = 26)

for (i in 1:26) {
  upper_case[i] <- str_to_upper(letters[i])
}

Back to normalize

This gives us the machinery to use our own function in a for loop.

First, recall how we can access a column vector using the [[ ]] syntax:

df[[1]]
#>  [1] -1.56695706  0.23880352 -0.52028396 -0.03587572  1.26976225 -1.03948139
#>  [7] -0.21172370  1.90549573 -0.53548764  0.22250909

So we can iteratively access each column in a for loop:

for (i in 1:4) {
  print(normalize(df[[i]]))
}
#>  [1] 0.0000000 0.5200245 0.3014218 0.4409221 0.8169209 0.1519029 0.3902813
#>  [8] 1.0000000 0.2970435 0.5153320
#>  [1] 0.9915841 1.0000000 0.4352500 0.0000000 0.8684414 0.9250426 0.3970564
#>  [8] 0.8335988 0.7141351 0.5377301
#>  [1] 1.0000000 0.5131949 0.4367290 0.1242323 0.5998479 0.8402709 0.0000000
#>  [8] 0.7587425 0.2170914 0.7489915
#>  [1] 0.85264136 0.05580354 0.79127366 0.10586772 0.00000000 1.00000000
#>  [7] 0.43627787 0.63372198 0.44736338 0.42096304

And again, we can generalize this to a data frame of of any length.

for (i in 1:length(df)) {
  print(normalize(df[[i]]))
}
#>  [1] 0.0000000 0.5200245 0.3014218 0.4409221 0.8169209 0.1519029 0.3902813
#>  [8] 1.0000000 0.2970435 0.5153320
#>  [1] 0.9915841 1.0000000 0.4352500 0.0000000 0.8684414 0.9250426 0.3970564
#>  [8] 0.8335988 0.7141351 0.5377301
#>  [1] 1.0000000 0.5131949 0.4367290 0.1242323 0.5998479 0.8402709 0.0000000
#>  [8] 0.7587425 0.2170914 0.7489915
#>  [1] 0.85264136 0.05580354 0.79127366 0.10586772 0.00000000 1.00000000
#>  [7] 0.43627787 0.63372198 0.44736338 0.42096304

Here again, we are just printing the output, not saving it to a new data frame.

So, according to our strategy, we want to create an empty data frame to hold our results. We can use information from our original data frame to do this.

empty_vec <- vector(length = nrow(df)) # Empty vector with correct number of rows

df_norm <- tibble(a = empty_vec, b = empty_vec, c = empty_vec, d = empty_vec)

for (i in 1:length(df)){
  df_norm[[i]] <- normalize(df[[i]])
}

df_norm
#> # A tibble: 10 × 4
#>        a     b     c      d
#>    <dbl> <dbl> <dbl>  <dbl>
#>  1 0     0.992 1     0.853 
#>  2 0.520 1     0.513 0.0558
#>  3 0.301 0.435 0.437 0.791 
#>  4 0.441 0     0.124 0.106 
#>  5 0.817 0.868 0.600 0     
#>  6 0.152 0.925 0.840 1     
#>  7 0.390 0.397 0     0.436 
#>  8 1     0.834 0.759 0.634 
#>  9 0.297 0.714 0.217 0.447 
#> 10 0.515 0.538 0.749 0.421

Using a map command.

It’s a pain to have to manually set up the ‘container’ that will house your results. Couldn’t the computer do that for us? Yes! All of this housekeeping, the for loop, the preallocation of data frame size, is done behind the scenes as part of the implementation of lapply() and map.

map_norm <- map(df, normalize)

str(map_norm)
#> List of 4
#>  $ a: num [1:10] 0 0.52 0.301 0.441 0.817 ...
#>  $ b: num [1:10] 0.992 1 0.435 0 0.868 ...
#>  $ c: num [1:10] 1 0.513 0.437 0.124 0.6 ...
#>  $ d: num [1:10] 0.8526 0.0558 0.7913 0.1059 0 ...

Notice that the output of map (like lapply) is a list. But we can easily convert it into a data frame:

map_norm_df <- map(df, normalize) %>% 
  as_tibble

str(map_norm_df)
#> tibble [10 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ a: num [1:10] 0 0.52 0.301 0.441 0.817 ...
#>  $ b: num [1:10] 0.992 1 0.435 0 0.868 ...
#>  $ c: num [1:10] 1 0.513 0.437 0.124 0.6 ...
#>  $ d: num [1:10] 0.8526 0.0558 0.7913 0.1059 0 ...

Michael Broe
Michael Broe
Bioinformatician at EEOB