S04E12: R for Data Science - Chapter 5.5: mutate

Data Transformation with dplyr, part IV: mutate()

Artwork by Allison Horst


Introduction

Today we are going to cover the dplyr function mutate(). You can today’s topic in the R 4 Data Science book at: https://r4ds.had.co.nz/transform.html#add-new-variables-with-mutate.

We will again be using the nycflights13 and tidyverse packages, so we first need make sure these packages are installed, and then load them for the current session by doing library() commands:

## Only if you haven't done so before, install the packages
install.packages("tidyverse")
install.packages("nycflights13")
## Load the packages
library(tidyverse)
library(nycflights13)


The mutate() function

The mutate() function always adds new columns at the end of your dataset. Let us create a small dataset so we can see the new variables:

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)

As a first example, let’s add columns gain (delay made up) and speed:

#> # A tibble: 336,776 × 9
#>     year month   day dep_delay arr_delay distance air_time  gain speed
#>    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
#>  1  2013     1     1         2        11     1400      227    -9  370.
#>  2  2013     1     1         4        20     1416      227   -16  374.
#>  3  2013     1     1         2        33     1089      160   -31  408.
#>  4  2013     1     1        -1       -18     1576      183    17  517.
#>  5  2013     1     1        -6       -25      762      116    19  394.
#>  6  2013     1     1        -4        12      719      150   -16  288.
#>  7  2013     1     1        -5        19     1065      158   -24  404.
#>  8  2013     1     1        -3       -14      229       53    11  259.
#>  9  2013     1     1        -3        -8      944      140     5  405.
#> 10  2013     1     1        -2         8      733      138   -10  319.
#> # … with 336,766 more rows
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)

transmute() is very similar to mutate(), except the returned dataframe only contains the new variables that were created:

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
#> # A tibble: 336,776 × 3
#>     gain hours gain_per_hour
#>    <dbl> <dbl>         <dbl>
#>  1    -9 3.78          -2.38
#>  2   -16 3.78          -4.23
#>  3   -31 2.67         -11.6 
#>  4    17 3.05           5.57
#>  5    19 1.93           9.83
#>  6   -16 2.5           -6.4 
#>  7   -24 2.63          -9.11
#>  8    11 0.883         12.5 
#>  9     5 2.33           2.14
#> 10   -10 2.3           -4.35
#> # … with 336,766 more rows


Useful creation functions to use with mutate

Note that functions must be vectorised before use with mutate().

Arithmetic operators: + (addition), - (subtraction), * (multiplication), / (division), ^ (to the power)

Modular arithmetic: %/% (integer division) and %% (remainder). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example:

transmute(flights,
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100
)
#> # A tibble: 336,776 × 3
#>    dep_time  hour minute
#>       <int> <dbl>  <dbl>
#>  1      517     5     17
#>  2      533     5     33
#>  3      542     5     42
#>  4      544     5     44
#>  5      554     5     54
#>  6      554     5     54
#>  7      555     5     55
#>  8      557     5     57
#>  9      557     5     57
#> 10      558     5     58
#> # … with 336,766 more rows

Logs: log (), log2(), log10(). Logarithms are useful transformation for dealing with data that ranges across multiple orders of magnitude.

Offsets: lead() and lag(). These allow you to refer to leading or lagging values:

(x <- 1:10)
#>  [1]  1  2  3  4  5  6  7  8  9 10
lag(x)
#>  [1] NA  1  2  3  4  5  6  7  8  9
lead(x)
#>  [1]  2  3  4  5  6  7  8  9 10 NA

Cumulative and rolling aggregates: R provides functions for sums cumsum(), products cumprod(), mins cummin(), and maxes cummax(); and dplyr provides function for mean cummean().

x
#>  [1]  1  2  3  4  5  6  7  8  9 10
cumsum(x)
#>  [1]  1  3  6 10 15 21 28 36 45 55
cummean(x)
#>  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

Logical comparisons: < (less than), <= (equal to or less than), > (greater than), >= (equal to or greater than), != (not equal to), and == (equal to).

Ranking: min_rank() gives the smallest values the small rank, and desc(x) to give the largest values the smallest ranks.

y <- c(1, 2, 2, NA, 3, 4)
#> [1]  1  2  2 NA  4  5
#> [1]  5  3  3 NA  2  1

There are other ranks functions, too, e.g., row_number(), dense_rank(), percent_rank(), cume_dist(), ntile().

#> [1]  1  2  3 NA  4  5
#> [1]  1  2  2 NA  3  4
#> [1] 0.00 0.25 0.25   NA 0.75 1.00


Breakout room exercises

These correspond to the exercises in the R4DS book, but we’re skipping exercises 2 and 6.

R4DS Exercise 1

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

Hints (click here)
  • You can use the following code structure (we first select only relevant columns to more easily check the results):
flights %>% 
  select(dep_time, sched_dep_time) %>% 
  mutate(
    flights_times,
    dep_time_mins = YOUR_CODE_HERE,
    sched_dep_time_mins = YOUR_CODE_HERE
    )
  • Get the number of hours since midnight using %/% (as in the example in the book) and then convert this number to minutes.

  • Then, get the number of minutes past the hour using %% (again as in the example in the book), and add this number to the previous one to get the total number of minutes past midnight.

  • Now, you might think you’re all done, but there is one remaining problem: midnight was originally represented as 2400, which the code described above would convert to 1440 (24 * 60). But it should be 0 instead. This is a tricky one, but you can handle it using an additional %% calculation.

Solution (click here)
  • To handle the midnight case, use %% 1440: this would simply return the original value for all values below 1440, and would return 0 only for 1440 (midnight):
flights %>%
  select(dep_time, sched_dep_time) %>% 
  mutate(
    dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
    sched_dep_time_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440
    )
#> # A tibble: 336,776 × 4
#>    dep_time sched_dep_time dep_time_mins sched_dep_time_mins
#>       <int>          <int>         <dbl>               <dbl>
#>  1      517            515           317                 315
#>  2      533            529           333                 329
#>  3      542            540           342                 340
#>  4      544            545           344                 345
#>  5      554            600           354                 360
#>  6      554            558           354                 358
#>  7      555            600           355                 360
#>  8      557            600           357                 360
#>  9      557            600           357                 360
#> 10      558            600           358                 360
#> # … with 336,766 more rows

R4DS Exercise 3

Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related? Test if this is indeed the case.

Hints (click here)
  • We should expect dep_delay to equal the difference between sched_dep_time and dep_time.

  • To check this, the first step is to use your code from exercise 1 to create columns with the sched_dep_time and the dep_time in minutes past midnight.

  • Then, create a column with the difference in minutes between sched_dep_time and dep_time (e.g. called our_delay_calc).

  • Next, create a column with the difference between our_delay_calc and dep_delay (e.g. called delay_diff).

  • Finally, use filter() to see if there are any rows where the delay_diff does not equal 0. Recall that we would expect no differences at all, if all is well. So what is going on with those rows?

Solution (click here)
flights %>% 
  select(dep_time, sched_dep_time, dep_delay) %>% 
  mutate(
    dep_time_min = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
    sched_dep_time_min = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440,
    our_delay_calc = dep_time_min - sched_dep_time_min,
    dep_delay_diff = our_delay_calc - dep_delay
  ) %>% 
  filter(dep_delay_diff != 0)
#> # A tibble: 1,236 × 7
#>    dep_time sched_dep_time dep_delay dep_time_min sched_dep_ti…¹ our_d…² dep_d…³
#>       <int>          <int>     <dbl>        <dbl>          <dbl>   <dbl>   <dbl>
#>  1      848           1835       853          528           1115    -587   -1440
#>  2       42           2359        43           42           1439   -1397   -1440
#>  3      126           2250       156           86           1370   -1284   -1440
#>  4       32           2359        33           32           1439   -1407   -1440
#>  5       50           2145       185           50           1305   -1255   -1440
#>  6      235           2359       156          155           1439   -1284   -1440
#>  7       25           2359        26           25           1439   -1414   -1440
#>  8      106           2245       141           66           1365   -1299   -1440
#>  9       14           2359        15           14           1439   -1425   -1440
#> 10       37           2230       127           37           1350   -1313   -1440
#> # … with 1,226 more rows, and abbreviated variable names ¹​sched_dep_time_min,
#> #   ²​our_delay_calc, ³​dep_delay_diff

These dep_delay_diff values are all 1440 minutes (= 24 hours), and as you can see, these are flights that were scheduled to depart one day but left only the next day (past midnight).

So, for our dep_time_min and sched_dep_time_min to always work for calculations, they should also take dates into account – and should e.g. be minutes past some date in the past.

R4DS Exercise 4

Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

Hints (click here)
  • Use mutate() to create a new column with the ranks, then arrange by this column and/or filter only top ranks to get the most delayed flights.

  • To see the differences in the handling of ties between row_number(), min_rank(), and dense_rank(), take a look this:

tibble(delays = c(3, 5, 5, 5, 130, 276),
       rank_rownr = row_number(delays),
       rank_min = min_rank(delays),
       rank_dense = dense_rank(delays))
#> # A tibble: 6 × 4
#>   delays rank_rownr rank_min rank_dense
#>    <dbl>      <int>    <int>      <int>
#> 1      3          1        1          1
#> 2      5          2        2          2
#> 3      5          3        2          2
#> 4      5          4        2          2
#> 5    130          5        5          3
#> 6    276          6        6          4
Solution (click here)
flights %>%
  select(dep_time, sched_dep_time, dep_delay) %>%
  mutate(dep_delay_rank = min_rank(desc(dep_delay))) %>% 
  filter(dep_delay_rank <= 10) %>%
  arrange(dep_delay_rank)#> # A tibble: 10 × 4
#>    dep_time sched_dep_time dep_delay dep_delay_rank
#>       <int>          <int>     <dbl>          <int>
#>  1      641            900      1301              1
#>  2     1432           1935      1137              2
#>  3     1121           1635      1126              3
#>  4     1139           1845      1014              4
#>  5      845           1600      1005              5
#>  6     1100           1900       960              6
#>  7     2321            810       911              7
#>  8      959           1900       899              8
#>  9     2257            759       898              9
#> 10      756           1700       896             10

R4DS Exercise 5

What does 1:3 + 1:10 return? Why?

Hints (click here)
  • First, you should realize that 1:3 expands to a vector with the values 1, 2, and 3, and similarly for 1:10.

  • Many R operations are vectorized, which means that when pairing two vectors (including the case where one of those vectors is just a single value), the shorter vector will be recycled. In the example below, 3 is recycled and added to every single value in the vector 1:5:

3 + 1:5
#> [1] 4 5 6 7 8
Solution (click here)
1:3 + 1:10
#> Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object length#>  [1]  2  4  6  5  7  9  8 10 12 11

R gives a warning because the length of the longer vector isn’t a multiple of the length of the shorter vector: it recycles 1:3 three times and is then left over with a single value from 1:3 to be paired with 10. This kind of thing is usually not intended.

Bonus exercise

You can use the paste0() function to combine strings of text, for instance:

carrier <- "UA"
flight <- 1545
paste0("The full flight number is: ", carrier, flight)
#> [1] "The full flight number is: UA1545"

Use paste0() inside mutate() to create a new column flight_no which has the full flight number (carrier followed by flight, like above) for each flight.

Solution (click here)
flights %>%
  select(carrier, flight) %>%
  mutate(flight_no = paste0(carrier, flight))
#> # A tibble: 336,776 × 3
#>    carrier flight flight_no
#>    <chr>    <int> <chr>    
#>  1 UA        1545 UA1545   
#>  2 UA        1714 UA1714   
#>  3 AA        1141 AA1141   
#>  4 B6         725 B6725    
#>  5 DL         461 DL461    
#>  6 UA        1696 UA1696   
#>  7 B6         507 B6507    
#>  8 EV        5708 EV5708   
#>  9 B6          79 B679     
#> 10 AA         301 AA301    
#> # … with 336,766 more rows




Acknowledgements

We used https://jrnold.github.io/r4ds-exercise-solutions to provide hints and solutions for the exercises.


Stephen Opiyo
Stephen Opiyo
Biostatistician at MCIC