S04E12: R for Data Science - Chapter 5.5: mutate
Data Transformation with dplyr, part IV: mutate()
Introduction
Today we are going to cover the dplyr function mutate()
. You can today’s topic in the R 4 Data Science book at: https://r4ds.had.co.nz/transform.html#add-new-variables-with-mutate.
We will again be using the nycflights13
and tidyverse
packages, so we first need make sure these packages are installed, and then load them for the current session by doing library()
commands:
## Only if you haven't done so before, install the packages
install.packages("tidyverse")
install.packages("nycflights13")
## Load the packages
library(tidyverse)
library(nycflights13)
The mutate()
function
The mutate()
function always adds new columns at the end of your dataset. Let us create a small dataset so we can see the new variables:
As a first example, let’s add columns gain
(delay made up) and speed
:
#> # A tibble: 336,776 × 9
#> year month day dep_delay arr_delay distance air_time gain speed
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 2 11 1400 227 -9 370.
#> 2 2013 1 1 4 20 1416 227 -16 374.
#> 3 2013 1 1 2 33 1089 160 -31 408.
#> 4 2013 1 1 -1 -18 1576 183 17 517.
#> 5 2013 1 1 -6 -25 762 116 19 394.
#> 6 2013 1 1 -4 12 719 150 -16 288.
#> 7 2013 1 1 -5 19 1065 158 -24 404.
#> 8 2013 1 1 -3 -14 229 53 11 259.
#> 9 2013 1 1 -3 -8 944 140 5 405.
#> 10 2013 1 1 -2 8 733 138 -10 319.
#> # … with 336,766 more rows
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
transmute()
is very similar to mutate()
, except the returned dataframe only contains the new variables that were created:
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
#> # A tibble: 336,776 × 3
#> gain hours gain_per_hour
#> <dbl> <dbl> <dbl>
#> 1 -9 3.78 -2.38
#> 2 -16 3.78 -4.23
#> 3 -31 2.67 -11.6
#> 4 17 3.05 5.57
#> 5 19 1.93 9.83
#> 6 -16 2.5 -6.4
#> 7 -24 2.63 -9.11
#> 8 11 0.883 12.5
#> 9 5 2.33 2.14
#> 10 -10 2.3 -4.35
#> # … with 336,766 more rows
Useful creation functions to use with mutate
Note that functions must be vectorised before use with mutate()
.
Arithmetic operators: +
(addition), -
(subtraction), *
(multiplication), /
(division), ^
(to the power)
Modular arithmetic: %/%
(integer division) and %%
(remainder). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example:
#> # A tibble: 336,776 × 3
#> dep_time hour minute
#> <int> <dbl> <dbl>
#> 1 517 5 17
#> 2 533 5 33
#> 3 542 5 42
#> 4 544 5 44
#> 5 554 5 54
#> 6 554 5 54
#> 7 555 5 55
#> 8 557 5 57
#> 9 557 5 57
#> 10 558 5 58
#> # … with 336,766 more rows
Logs: log ()
, log2()
, log10()
. Logarithms are useful transformation for dealing with data that ranges across multiple orders of magnitude.
Offsets: lead()
and lag()
. These allow you to refer to leading or lagging values:
(x <- 1:10)
#> [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
#> [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
#> [1] 2 3 4 5 6 7 8 9 10 NA
Cumulative and rolling aggregates: R provides functions for sums cumsum()
, products cumprod()
, mins cummin()
, and maxes cummax()
; and dplyr provides function for mean cummean()
.
x
#> [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
#> [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
#> [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
Logical comparisons: <
(less than), <=
(equal to or less than), >
(greater than), >=
(equal to or greater than), !=
(not equal to), and ==
(equal to).
Ranking: min_rank()
gives the smallest values the small rank, and desc(x)
to give the largest values the smallest ranks.
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
#> [1] 1 2 2 NA 4 5
#> [1] 5 3 3 NA 2 1
There are other ranks functions, too, e.g., row_number()
, dense_rank()
, percent_rank()
, cume_dist()
, ntile()
.
row_number(y)
#> [1] 1 2 3 NA 4 5
dense_rank(y)
#> [1] 1 2 2 NA 3 4
percent_rank(y)
#> [1] 0.00 0.25 0.25 NA 0.75 1.00
Breakout room exercises
These correspond to the exercises in the R4DS book, but we’re skipping exercises 2 and 6.
R4DS Exercise 1
Currently dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
Hints (click here)
- You can use the following code structure (we first select only relevant columns to more easily check the results):
flights %>%
select(dep_time, sched_dep_time) %>%
mutate(
flights_times,
dep_time_mins = YOUR_CODE_HERE,
sched_dep_time_mins = YOUR_CODE_HERE
)
-
Get the number of hours since midnight using
%/%
(as in the example in the book) and then convert this number to minutes. -
Then, get the number of minutes past the hour using
%%
(again as in the example in the book), and add this number to the previous one to get the total number of minutes past midnight. -
Now, you might think you’re all done, but there is one remaining problem: midnight was originally represented as
2400
, which the code described above would convert to1440
(24 * 60
). But it should be0
instead. This is a tricky one, but you can handle it using an additional%%
calculation.
Solution (click here)
- To handle the midnight case, use
%% 1440
: this would simply return the original value for all values below 1440, and would return0
only for1440
(midnight):
flights %>%
select(dep_time, sched_dep_time) %>%
mutate(
dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
sched_dep_time_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440
)
#> # A tibble: 336,776 × 4
#> dep_time sched_dep_time dep_time_mins sched_dep_time_mins
#> <int> <int> <dbl> <dbl>
#> 1 517 515 317 315
#> 2 533 529 333 329
#> 3 542 540 342 340
#> 4 544 545 344 345
#> 5 554 600 354 360
#> 6 554 558 354 358
#> 7 555 600 355 360
#> 8 557 600 357 360
#> 9 557 600 357 360
#> 10 558 600 358 360
#> # … with 336,766 more rows
R4DS Exercise 3
Compare dep_time
, sched_dep_time
, and dep_delay
. How would you expect those three numbers to be related? Test if this is indeed the case.
Hints (click here)
-
We should expect
dep_delay
to equal the difference betweensched_dep_time
anddep_time
. -
To check this, the first step is to use your code from exercise 1 to create columns with the
sched_dep_time
and thedep_time
in minutes past midnight. -
Then, create a column with the difference in minutes between
sched_dep_time
anddep_time
(e.g. calledour_delay_calc
). -
Next, create a column with the difference between
our_delay_calc
anddep_delay
(e.g. calleddelay_diff
). -
Finally, use
filter()
to see if there are any rows where thedelay_diff
does not equal 0. Recall that we would expect no differences at all, if all is well. So what is going on with those rows?
Solution (click here)
flights %>%
select(dep_time, sched_dep_time, dep_delay) %>%
mutate(
dep_time_min = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
sched_dep_time_min = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440,
our_delay_calc = dep_time_min - sched_dep_time_min,
dep_delay_diff = our_delay_calc - dep_delay
) %>%
filter(dep_delay_diff != 0)
#> # A tibble: 1,236 × 7
#> dep_time sched_dep_time dep_delay dep_time_min sched_dep_ti…¹ our_d…² dep_d…³
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 848 1835 853 528 1115 -587 -1440
#> 2 42 2359 43 42 1439 -1397 -1440
#> 3 126 2250 156 86 1370 -1284 -1440
#> 4 32 2359 33 32 1439 -1407 -1440
#> 5 50 2145 185 50 1305 -1255 -1440
#> 6 235 2359 156 155 1439 -1284 -1440
#> 7 25 2359 26 25 1439 -1414 -1440
#> 8 106 2245 141 66 1365 -1299 -1440
#> 9 14 2359 15 14 1439 -1425 -1440
#> 10 37 2230 127 37 1350 -1313 -1440
#> # … with 1,226 more rows, and abbreviated variable names ¹sched_dep_time_min,
#> # ²our_delay_calc, ³dep_delay_diff
These dep_delay_diff
values are all 1440 minutes (= 24 hours), and as you can see, these are flights that were scheduled to depart one day but left only the next day (past midnight).
So, for our dep_time_min
and sched_dep_time_min
to always work for calculations, they should also take dates into account – and should e.g. be minutes past some date in the past.
R4DS Exercise 4
Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank()
.
Hints (click here)
-
Use
mutate()
to create a new column with the ranks, then arrange by this column and/or filter only top ranks to get the most delayed flights. -
To see the differences in the handling of ties between
row_number()
,min_rank()
, anddense_rank()
, take a look this:
tibble(delays = c(3, 5, 5, 5, 130, 276),
rank_rownr = row_number(delays),
rank_min = min_rank(delays),
rank_dense = dense_rank(delays))
#> # A tibble: 6 × 4
#> delays rank_rownr rank_min rank_dense
#> <dbl> <int> <int> <int>
#> 1 3 1 1 1
#> 2 5 2 2 2
#> 3 5 3 2 2
#> 4 5 4 2 2
#> 5 130 5 5 3
#> 6 276 6 6 4
Solution (click here)
flights %>%
select(dep_time, sched_dep_time, dep_delay) %>%
mutate(dep_delay_rank = min_rank(desc(dep_delay))) %>%
filter(dep_delay_rank <= 10) %>%
arrange(dep_delay_rank)#> # A tibble: 10 × 4
#> dep_time sched_dep_time dep_delay dep_delay_rank
#> <int> <int> <dbl> <int>
#> 1 641 900 1301 1
#> 2 1432 1935 1137 2
#> 3 1121 1635 1126 3
#> 4 1139 1845 1014 4
#> 5 845 1600 1005 5
#> 6 1100 1900 960 6
#> 7 2321 810 911 7
#> 8 959 1900 899 8
#> 9 2257 759 898 9
#> 10 756 1700 896 10
R4DS Exercise 5
What does 1:3 + 1:10
return? Why?
Hints (click here)
-
First, you should realize that
1:3
expands to a vector with the values1
,2
, and3
, and similarly for1:10
. -
Many R operations are vectorized, which means that when pairing two vectors (including the case where one of those vectors is just a single value), the shorter vector will be recycled. In the example below,
3
is recycled and added to every single value in the vector1:5
:
3 + 1:5
#> [1] 4 5 6 7 8
Solution (click here)
1:3 + 1:10
#> Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object length#> [1] 2 4 6 5 7 9 8 10 12 11
R gives a warning because the length of the longer vector isn’t a multiple of the length of the shorter vector: it recycles 1:3
three times and is then left over with a single value from 1:3
to be paired with 10
. This kind of thing is usually not intended.
Bonus exercise
You can use the paste0()
function to combine strings of text, for instance:
carrier <- "UA"
flight <- 1545
paste0("The full flight number is: ", carrier, flight)
#> [1] "The full flight number is: UA1545"
Use paste0()
inside mutate()
to create a new column flight_no
which has the full flight number (carrier followed by flight, like above) for each flight.
Solution (click here)
#> # A tibble: 336,776 × 3
#> carrier flight flight_no
#> <chr> <int> <chr>
#> 1 UA 1545 UA1545
#> 2 UA 1714 UA1714
#> 3 AA 1141 AA1141
#> 4 B6 725 B6725
#> 5 DL 461 DL461
#> 6 UA 1696 UA1696
#> 7 B6 507 B6507
#> 8 EV 5708 EV5708
#> 9 B6 79 B679
#> 10 AA 301 AA301
#> # … with 336,766 more rows
Acknowledgements
We used https://jrnold.github.io/r4ds-exercise-solutions to provide hints and solutions for the exercises.