Session 12: Vectorization and loops in R

Need for repeat.

Artwork by @allison_horst





Setup

New to Code Club?

  • If you didn’t already do this, please follow the Code Club Computer Setup instructions, which also have pointers for if you’re new to R or RStudio.

  • If you’re able to do so, please open RStudio a bit before Code Club starts – and in case you run into issues, please join the Zoom call early and we’ll help you troubleshoot.

Session goals

Today, you will learn:

  • That you should avoid copying your code.
  • What different strategies for iteration exist in R.
  • What vectorization is and how to make use of it.
  • How to write a for loop.
  • Best practices when using for loops.
  • When you should (not) use for loops.
  • Bonus: if statements.


Introduction

Don’t Repeat Yourself

Sometimes, you have a block of code and you need to repeat the operations in that code almost exactly. For instance, you may want to rerun a statistical model with different parameter values, rerun an analysis for a different batch of samples, or extract the same information for many different genes.

Your first instinct may be to copy-and-paste the block of code, and make the necessary slight adjustments in the pasted block. However, iterating and writing your own functions are strategies that are clearer, less error-prone, and more flexible (and these two can also be combined). When the number of repeats are high, iteration is needed. When the code that needs to be repeated is more than a line or two, writing your own functions becomes useful.

Iteration

Loops are the most universal iteration tool and the one we will focus on today. However, R has “functional programming” iteration methods that are less verbose and that can also be quicker to execute. These are the apply family of functions, and a more recent tidyverse approach implemented in the purrr package: we will learn more about those in the two upcoming Code Club sessions.

Loops are still a very good place to start using iteration because they make the iteration explicit and are therefore more intuitive than functional alternatives. In addition, they can easily accommodate longer blocks of code without the need to also write your own function.

Today, we will talk about the most common type of loop: the for loops. (Other types of loops in R are while loops and repeat loops. Related to loops are if statements, see the bonus exercise for some basics.)

But first…

Before we tackle loops we should take a step back and explore vectorization a bit more, which was briefly introduced by Michael in Code Club session 9. Besides functional programming methods, vectorization is the other reason that loops are not as widely used in R as in other programming languages.


I: Vectorization

Let’s say we have a vector (i.e., a collection of values) that consists of distances in miles:

dists_miles <- c(24, 81, 48, 29, 177, 175, 20, 11, 62, 156)

Of course, we can’t science with miles, so we’ll have to convert these distances to kilometers by multiplying each value in the vector by 1.61. You may or may not know that this can be done really easily in R:

dists_km <- dists_miles * 1.61
dists_km
#>  [1]  38.64 130.41  77.28  46.69 284.97 281.75  32.20  17.71  99.82 251.16

What is happening here is called a vectorized operation: 1.61 is automatically recycled as many times as needed to be multiplied with each individual value in the dist_miles vector. This is a pretty unique and very useful feature of R!

In many other languages, we would need a loop or a similar construct to iterate over each value in the vector and multiply by 1.61. In fact, under the hood, R also uses a loop to do this! So does it even make a difference? Yes – the advantages of using vectorization in R are:

  • You don’t have to write the loop, saving you a fair bit of typing and making the code clearer.

  • The under-the-hood-loop is being executed much faster than a loop that you would write. This is because it is written in C/C++ code which only has to be called once (instead of at least as many times as there are iterations in our loop).


Other vectorization patterns

Above, we saw an example of multiplying a vector by a single number. We can also use vectorized operations when both objects contain multiple items. For instance, say we have a vector with corresponding values for two dates:

dists_Mar4 <- c(17, 93, 56, 19, 175, 40, 69, 267, 4, 91)
dists_Mar5 <- c(87, 143, 103, 223, 106, 18, 87, 72, 59, 5)

To get the sum of these values at each position (index) of the two vectors (17 + 87, 93 + 143, etc.), we can simply do the following:

dists_Mar4 + dists_Mar5
#>  [1] 104 236 159 242 281  58 156 339  63  96

The two vectors don’t need to be of equal length, either:
in the example below, we negate every other value in a vector:

c(17, 93, 56, 19, 175, 40, 69, 267, 4, 91) * c(1, -1)
#>  [1]   17  -93   56  -19  175  -40   69 -267    4  -91

This also works for columns of a data frame, which we can extract using the dataframe_name$column_name notation (see Code Club session 9’s section on data frames, and the Base R data frame indexing summary below). Let’s say we wanted the mean distance this time:

dist_df <- data.frame(dists_Mar4, dists_Mar5)

dist_df$dists_mean = (dist_df$dists_Mar4 + dist_df$dists_Mar5) / 2
head(dist_df)
#>   dists_Mar4 dists_Mar5 dists_mean
#> 1         17         87       52.0
#> 2         93        143      118.0
#> 3         56        103       79.5
#> 4         19        223      121.0
#> 5        175        106      140.5
#> 6         40         18       29.0

Vectorization with matrices

Furthermore, we can also perform vectorized operations on entire matrices. With the following matrix:

## We use the "sample" function to get 25 random values between 1 and a 100,
## and put those in a 5*5 matrix:
mat <- matrix(sample(1:100, 25), nrow = 5, ncol = 5)
mat
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    2   55    8   28   24
#> [2,]   81   30   99   33   12
#> [3,]   22   67   41   54    6
#> [4,]   48   84   42   35  100
#> [5,]   57   47   93   10   31

…we could multiple all values by 10 or get the square of each value simply as follows:

mat_more <- mat * 10
mat_more
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]   20  550   80  280  240
#> [2,]  810  300  990  330  120
#> [3,]  220  670  410  540   60
#> [4,]  480  840  420  350 1000
#> [5,]  570  470  930  100  310

mat_squared <- mat * mat
mat_squared
#>      [,1] [,2] [,3] [,4]  [,5]
#> [1,]    4 3025   64  784   576
#> [2,] 6561  900 9801 1089   144
#> [3,]  484 4489 1681 2916    36
#> [4,] 2304 7056 1764 1225 10000
#> [5,] 3249 2209 8649  100   961

Vectorization with indices

We can also use vectorized solutions when we want to operate only on elements that satisfy a certain condition.

Let’s say we consider any distance in one of our vectors that is below 50 to be insufficient, and we want to turn those values into negatives (a little harsh maybe, but we go with it).

To do so, we make use of R’s ability to index a vector with a logical vector:

## "not_far_enough" will be a vector of logicals:
not_far_enough <- dists_Mar4 < 50
not_far_enough
#>  [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE

## When we index the original vector with a logical vector,
## we get only those values for which "not_far_enough" is TRUE:
dists_Mar4[not_far_enough]
#> [1] 17 19 40  4

With the following syntax, we can replace just those low distances in our original vector:

dists_Mar4[not_far_enough] <- dists_Mar4[not_far_enough] * -1
dists_Mar4
#>  [1] -17  93  56 -19 175 -40  69 267  -4  91

In a simple case like this, we could also use the vectorized ifelse() function:

ifelse(dists_Mar5 < 50, dists_Mar5 * -1, dists_Mar5)
#>  [1]  87 143 103 223 106 -18  87  72  59  -5


II: For loops

While it is important to use vectorization whenever possible, it can only be applied to a specific set of problems. A more universal solution when you need to repeat operations is the for loop. for loops iterate over a collection of values, allowing you to perform one or more actions for each value in the collection.

The basic syntax is as follows:

for (variable_name in collection_name) {
  #...do things for each item (variable_name) in the collection, one at a time...
}

On the first line, you initialize the for loop, telling it to assign each item in the collection to a variable (here, variable_name) one at a time.

The variable name is arbitrary, and the collection is whatever you want to loop over. However, for, the parentheses (), in, and the curly braces {} are all fixed elements of for loops. A simple example will help to understand the synax:

## A loop to print negated values:
for (one_number in c(1, 2, 3, 4)) {  # We iterate over 1, 2, 3, 4
  print(one_number * -1)             # Multiply each number by -1
}
#> [1] -1
#> [1] -2
#> [1] -3
#> [1] -4

Note that we don’t have to use the variable that we are looping over: we could also use a for loop as a roundabout way to simply repeat something as many times as there are values in our collection:

for (dummy in c(1, 2, 3, 4)) {
  print("Yes!")                     # Print "Yes!" in each of our four iterations 
}
#> [1] "Yes!"
#> [1] "Yes!"
#> [1] "Yes!"
#> [1] "Yes!"

As mentioned, the variable name that we assign is arbitrary: we could use anything, as long as we reference it with the same name inside the loop:

## Example 1 with a different variable name: "positive_number"
for (positive_number in c(1, 2, 3, 4)) {
  print(positive_number * -1)
}
#> [1] -1
#> [1] -2
#> [1] -3
#> [1] -4

## Example 2 with a different variable name: "i"
for (i in c(1, 2, 3, 4)) {
  print(i * -1)
}
#> [1] -1
#> [1] -2
#> [1] -3
#> [1] -4

Note that the variable as it was last assigned in the loop does persist in your environment:

i
#> [1] 4

The curly braces are not strictly necessary for one-liners like this:

for (i in 1:4) print(i * -1)
#> [1] -1
#> [1] -2
#> [1] -3
#> [1] -4

for loop output

Note that we need the print() function to print anything to screen – nothing will be printed if we omit this:

for (i in 1:4) {
  i * -1
}

Similarly, if we want the output to be saved in an object of some kind, we need to explicitly make an assignment in each iteration of the loop. This is where we need to start paying attention to the design of our loop. Unless computational speed is of no concern at all, you should avoid growing an object in each iteration of the loop.

For example, you might be inclined to do the following if you wanted to compute the median of each column in a data frame:

## We initialize a vector in which we collect the column medians:
column_medians <- vector()

for (column_number in 1:ncol(dist_df)) {
  
  ## We extract one column using "dataframe_name[[column_number]]":
  column_median <- median(dist_df[[column_number]])
  
  ## We add the single-column median to our vector of medians:
  column_medians <- c(column_medians, column_median)
}

column_medians
#> [1] 62.50 87.00 78.75

Similarly, you may be tempted to add a column (with cbind()) or a row (with rbind()) to a data frame in each iteration of the loop. However, the problem with these approaches is that R has to create an entirely new object in each iteration of the loop, because the object’s memory requirements keep increasing.

Instead, you’ll want to give the final vector (here, column_medians) the appropriate size before you start the loop:

column_medians <- vector(length = ncol(dist_df))

for (column_number in 1:ncol(dist_df)) {
  column_median <- median(dist_df[[column_number]])
  column_medians[column_number] <- column_median
}

Note that for very small problems, such as the example above, there will not be a noticeable difference in computation time between pre-assigning a properly sized object versus growing an object inside the loop. However, it is still good to get into the habit of pre-assigning an object of the right size.


Summary guidelines (when speed is an issue)

  • Don’t use a loop when you can instead use vectorized operations.
  • Don’t grow objects inside the loop. Instead, pre-assign an object large enough to contain all output of the loop and fill it in inside the loop.
  • When you write a loop, avoid doing things inside the loop that don’t need to be repeated.

Learning about how to create your own functions and/or to use functional programming techniques like purrr and the apply family of functions (upcoming Code Club sessions!) will likely reduce your reliance on loops. For instance, as we’ll see next week, computing the median of each column in a data frame can be done much more succinctly with apply().

Even for more experienced users, loops remain a more viable option when longer blocks of code need to be repeated: we will practice with that in the exercises.



Breakout rooms!

For the exercises, you can download an R Markdown file with some code to get set up (I recommend coding in that document to get a nice overview of the maps that you will make):

dir.create('S12')
todays_rmd <- 'https://raw.githubusercontent.com/biodash/biodash.github.io/master/content/codeclub/12_loops/exercises.Rmd'
download.file(url = todays_rmd, destfile = 'S12/exercises.Rmd')

The following code is already in your R Markdown file, which will download and read the bird dataset and the necessary packages:

## Download the file with bird data:
birds_url <- 'https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/birds/backyard-birds_sample_error.tsv'
birds_file <- 'backyard-birds_sample_error.tsv'
download.file(url = birds_url, destfile = birds_file)
## Load the tidyverse:
library(tidyverse)

## Read the file with bird data:
birds <- read_tsv(birds_file)

## Load the maps package and get the state map:
# install.packages('maps')   # first install if necessary
library(maps)
states <- map_data("state")

Last week, we learned about making maps. If you attended one of the first few Code Club sessions, you’ll recall our Great Backyard Birdcount data set. Here, we’ll use a country-wide random subset of this data (the full file is over 4 GB) to see where Carolina Chickadees were seen:

## With this line, we select only the rows where the column "species_en"
## (English species name) equals "Carolina Chickadee",
## i.e. we are getting just the records for the Carolina Chickadee:
caro_chickadee <- birds[birds$species_en == 'Carolina Chickadee', ]

## Or in tidyverse-speak:
# caro_chickadee <- birds %>% filter(species_en == 'Carolina Chickadee')

# Next, we create a map much like we did last week:
ggplot(data = states,                               # Use "states" for underlying map
       mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(color = "black", fill = "white") +   # Black state outlines, white fill
  geom_point(data = caro_chickadee,                 # Plot points from bird data set
             aes(x = long, y = lat, group = NULL),
             color = "green4", alpha = 0.5) +       # Green points, somewhat transparent
  coord_fixed(1.3) +                                # Fix projection
  labs(title = 'Carolina Chickadee')

Uh-oh! Something appears to have gone wrong. In the first exercise, you’ll use vectorization to fix the coordinates in the bird data set.

In the second exercise, you’ll use a loop to quickly produce similar plots for several other species.

Exercise 1: Vectorization

Try to fix the coordinates using vectorized operations, and recreate the map to see if it worked.

  • Start with the latitude, which is wrong for all points.
Hints (click here)

  • You’ll need to modify the caro_chickadee data frame, while you can keep the plotting code exactly the same.

  • Simply prepending the latitude column with a minus sign (-) will negate the values.

  • Equivalent base R ways to refer to the column with latitudes are caro_chickadee$lat and caro_chickadee[['lat']].


Solution (click here)

First we fix the latitude, which was simply negated:

caro_chickadee$lat <- -caro_chickadee$lat

## Or equivalently:
# caro_chickadee[['lat']] <- -caro_chickadee[['lat']]

## Or a tidyverse way of doing this:
# caro_chickadee <- caro_chickadee %>% mutate(lat = -lat)

Create the first map with the same code as the example:

ggplot(data = states,
       mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(color = "black", fill = "white") +
  geom_point(data = caro_chickadee,
             aes(x = long, y = lat, group = NULL),
             color = "green4", alpha = 0.5) +
  coord_fixed(1.3) +
  labs(title = 'Carolina Chickadee')


  • Once you have fixed the latitude, you should notice that for one state, there is a problem with the longitude (the offset is 10 decimal degrees).
Hints (click here)

  • The displaced state is North Carolina.

  • The state of each sighting is in the stateProvince column, and North Carolina’s name is simply “North Carolina” in that column.

  • It may help to first create a logical vector indicating whether for each row in the caro_chickadee data frame, stateProvincefor equals “North Carolina”.

  • Your final map will look nicer if you get rid of the plotting canvas by adding
    + theme_void() to the code for the plot.


Solution (click here)

It turns out that North Carolina’s chickadees are above the Atlantic. Let’s perform a rescue operation by fixing the longitudes, which are offset by 10 degrees, just for North Carolina:

## Get a vector of logicals, indicating which rows are from North Carolina:
NC_rows <- caro_chickadee$stateProvince == "North Carolina"

## Only for North Carolina rows, change the longitude:
caro_chickadee$long[NC_rows] <- caro_chickadee$long[NC_rows] - 10

## Or with ifelse in one line:
# caro_chickadee$long <- ifelse(caro_chickadee$stateProvince == "North Carolina",
#                               caro_chickadee$long - 10,
#                               caro_chickadee$long)

## Or with mutate and ifelse:
# caro_chickadee %>%
#   mutate(long = ifelse(stateProvince == "North Carolina", long - 10, long))

And we create the final map:

ggplot(data = states,
       mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(color = "black", fill = "white") +
  geom_point(data = caro_chickadee,
             aes(x = long, y = lat, group = NULL),
             color = "green4", alpha = 0.5) +
  coord_fixed(1.3) +
  labs(title = 'Carolina Chickadee') +
  theme_void()

Nice!


Exercise 2: for loops

Find the 10 most commonly observed bird species in the data set, and save their English names (found in the species_en column) in a vector.

Feel free to check out the solution if you’re not sure how, because the focus here is on the next step: trying to create a loop.

Solution (click here)

top10 <- birds %>%
  count(species_en, sort = TRUE) %>%  # Produces a sorted count table for "species_en"
  pull(species_en) %>%                # Extracts the "species_en" column
  head(n = 10)                        # Take the top 10


Next, loop over the top-10 species to produce a plot for each one of them.
Start with the code for the Carolina Chickadee, including the subsetting operation, and modify that.

Hints (click here)

  • In the subsetting operation where you select data for the focal species, replace “Carolina Chickadee” with whatever you name the variable (indicating an individual species) that you loop over.

    Because this is a variable name, and not a string like “Carolina Chickadee”, don’t forget to omit the quotes.

  • You’ll also need to change the title with the looping variable.


Solution (click here)
for (one_species in top10) {

## Select just the data for one species:
one_bird_data <- birds[birds$species_en == one_species, ]
## Or in tidyverse-speak:
# one_bird_data <- birds %>% filter(species_en == one_species)

p <- ggplot(data = states,
            mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(color = "black", fill = "white") +
  geom_point(data = one_bird_data,
             aes(x = long, y = lat, group = NULL),
             color = "green4", alpha = 0.5) +
  coord_fixed(1.3) +
  labs(title = one_species) +  # Make sure to change this to the looping variable
  theme_void()
print(p)
}


Bonus exercise: if statements

if statements are similar in syntax to for loops, and are also considered a “control flow” structure. But their purpose is different from loops: instead of iterating, if statements do something once and they only do it when a condition is fulfilled.

For instance, we may want to check in a script whether a certain directory (folder) exists on our computer, and if it doesn’t, then we create the directory:

if (dir.exists('path/to/my/dir')) {
  warning("Oh my, the output directory doesn't exist yet!")
  dir.create('path/to/my/dir')
}

Inside the parentheses () after if should be a statement that evaluates to either TRUE or FALSE (dir.exists() will be TRUE if the directory exists, and FALSE if it does not). If it is TRUE, whatever is inside the curly braces {} will be executed, and if it is FALSE, what is inside the curly braces will be ignored.

if statements are commonly combined with for loops: we may want to only execute the functions in our loop for items in our collection that fulfill a certain condition:

for (one_number in 1:10) {
  if(one_number > 7) {
    print(one_number)
  }
}
#> [1] 8
#> [1] 9
#> [1] 10

In the example above, one number > 7 will only be TRUE for numbers larger than 7. This example is quite contrived, as it would have been easier (and faster!) to remove these items from the vector before the loop, but it hopefully gets the point across of how an if statement works.

Many of the maps we produced in the previous exercise looked quite similar, with most species very widespread and a few restricted to the east of the US. Maybe if we select species that haven’t been seen in Ohio, we can find some other distributional patterns.

First, select the the top 50 most observed bird species, just like you did in exercise 2.

Then, use an if statement to create plots only for those top-50 birds that have not been seen in Ohio.

Solution (click here)
  • Select the top-50 birds:
all_species <- birds %>%
  count(species_en, sort = TRUE) %>%
  pull(species_en) %>%
  head(n = 50)
  • Loop over the species:
for (one_species in all_species) {

  ## Select the focal species:
  one_bird <- birds[birds$species_en == one_species, ]
  
  ## Create a data frame with only records from Ohio:
  one_bird_ohio <- one_bird[one_bird$stateProvince == 'Ohio', ]

  ## Test whether the data frame with only records from Ohio has any rows.
  ## If it does not, we create the map for the species in question: 
  if(nrow(one_bird_ohio) == 0) {
  
    p <- ggplot(data = states,
           mapping = aes(x = long, y = lat, group = group)) +
      geom_polygon(color = "black", fill = "white") +
      geom_point(data = one_bird,
                 aes(x = long, y = lat, group = NULL),
                 color = "green4", alpha = 0.5) +
      coord_fixed(1.3) +
      labs(title = one_species) +
      theme_void()
    print(p)
    }
}




Going further

Base R data frame indexing

Extract a column as a vector:

## By name:
birds$lat
birds[['lat']]   # Equivalent, $ notation is shorthand

## By index (column number):
birds[[8]]

Extract one or more columns as a data frame using [row, column] notation,
with a leading comma ([, column]) meaning all rows:

## By name:
birds[, 'lat']   # dataframe['row_name', 'column_name']
birds[, c('lat', 'long')]

## By index (column numbers):
birds[, 8]       # dataframe[row_number, column_number]
birds[, c(8, 9)]

Subset rows by a condition, with a trailing comma ([row, ]) meaning all columns:

birds[birds$lat > 25, ]
birds[birds$species_en == 'Carolina Chickadee', ]

seq_along()

To loop over column indices, we have used 1:ncol() above, and to loop over vector indices, you could similarly use 1:length().

However, an alternative is seq_along(), which will create an index for you.

birds <- c('titmouse', 'chickadee', 'cardinal')
seq_along(birds)
#> [1] 1 2 3

The advantage of seq_along() is thtat it will behave better when your vector accidentally has length 0 (because 1:length() will have 1 and 0 when the length is 0, and you’ll get odd-seeming errors).

Further reading


Jelmer Poelstra
Jelmer Poelstra
Bioinformatician at MCIC