Code Club S02E05: Intro to the Tidyverse (Part 2)


Prep homework

Basic computer setup

  • If you didn’t already do this, please follow the Code Club Computer Setup instructions, which also has pointers for if you’re new to R or RStudio.

  • If you’re able to do so, please open RStudio a bit before Code Club starts – and in case you run into issues, please join the Zoom call early and we’ll troubleshoot.



Introduction

What will we go over today

  • We will continue using the dplyr package, which is part of the tidyverse and was introduced last week.
  • Learn using arrange() - orders the rows of a data frame by the values of selected columns.
  • Learn using mutate() - adds new variables and preserves existing ones.


1 - What is the dplyr package?

dplyr is one of the tidyverse packages that are designed for data science. dplyr provides functions for data manipulation.

Functions for row-wise operations include:

  • filter() - chooses rows based on column values.
  • slice() - chooses rows based on location.
  • arrange() - orders the rows of a data frame by the values of selected columns.

Functions for column-wise operations include:

  • select() - changes whether or not a column is included.
  • rename() - changes the name of columns.
  • mutate() - changes the values of columns and creates new columns
  • relocate() - changes the order of the columns.

Functions for groups of rows include:

Last week, we got introduced to the tidyverse and covered the %>% pipe, select(), and filter. We saw that packages are basically R add-ons that contain additional functions or datasets we can use. Using the function install.packages(), we can install packages that are available at the Comprehensive R Archive Network, or CRAN.

For those who have not installed the tidyverse, let’s install it. We only need to do this once, so if you did this last week, you don’t need to now.

install.packages("tidyverse")

To use the dplyr package within the tidyverse, we need to call it up using library().

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#>  ggplot2 3.3.5      purrr   0.3.4
#>  tibble  3.1.4      dplyr   1.0.7
#>  tidyr   1.1.3      stringr 1.4.0
#>  readr   2.0.1      forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()


2 - Using the arrange() function

We will learn how to use the arrange() function from dplyr to sort a data frame in multiple ways. First, we will sort a dataframe by values of a single variable, and then we will learn how to sort a dataframe by more than one variable in the dataframe. By default, dplyr’s arrange() sorts in ascending order (lowest values first).

Let’s get set up and grab some data so that we have some material to work with.

We will use the same dataset palmerpenguins we used last week. To get this data, we need to install the palmerpenguins package (again, no need to do this if you already did so last week):

install.packages("palmerpenguins")

Then, to use the package, we need to use the function library() to load the package in R:

The dataframe we will use today is called penguins. Let’s take a look at the structure of the data:

# look at the first 10 rows and all columns
head(penguins, 10)
#> # A tibble: 10 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torgersen           39.1          18.7               181        3750
#>  2 Adelie  Torgersen           39.5          17.4               186        3800
#>  3 Adelie  Torgersen           40.3          18                 195        3250
#>  4 Adelie  Torgersen           NA            NA                  NA          NA
#>  5 Adelie  Torgersen           36.7          19.3               193        3450
#>  6 Adelie  Torgersen           39.3          20.6               190        3650
#>  7 Adelie  Torgersen           38.9          17.8               181        3625
#>  8 Adelie  Torgersen           39.2          19.6               195        4675
#>  9 Adelie  Torgersen           34.1          18.1               193        3475
#> 10 Adelie  Torgersen           42            20.2               190        4250
#> # … with 2 more variables: sex <fct>, year <int>
# check the structure of penguins_data
# glimpse() which is a part of dplyr functions 
# similarly to str() and can be used interchangeably
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Okay, now we have a sense of what the penguins dataset is.

Now we want to sort the penguins dataframe by body mass to quickly learn about the lightest penguin and its relations to other variables. We will use the pipe operator %>% to feed the data to the arrange() function. We then specify name of the variable that we want to sort the dataframe by.

In this example, we are sorting by variable body_mass_g, so we will see the lightest penguins at the top of the dataframe:

penguins %>%           # take the penguins_data
  arrange(body_mass_g) # sort the dataframe in ascending order based on body mass
#> # A tibble: 344 × 8
#>    species   island    bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>     <fct>              <dbl>         <dbl>            <int>       <int>
#>  1 Chinstrap Dream               46.9          16.6              192        2700
#>  2 Adelie    Biscoe              36.5          16.6              181        2850
#>  3 Adelie    Biscoe              36.4          17.1              184        2850
#>  4 Adelie    Biscoe              34.5          18.1              187        2900
#>  5 Adelie    Dream               33.1          16.1              178        2900
#>  6 Adelie    Torgersen           38.6          17                188        2900
#>  7 Chinstrap Dream               43.2          16.6              187        2900
#>  8 Adelie    Biscoe              37.9          18.6              193        2925
#>  9 Adelie    Dream               37.5          18.9              179        2975
#> 10 Adelie    Dream               37            16.9              185        3000
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

If we wanted to sort descendingly, such that the heaviest penguins are in the first rows, we can add a - in front of the variable:

penguins %>%             # take the penguins_data
  arrange(-body_mass_g)  # sort the dataframe in descending order based on body mass
#> # A tibble: 344 × 8
#>    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
#>  1 Gentoo  Biscoe           49.2          15.2               221        6300
#>  2 Gentoo  Biscoe           59.6          17                 230        6050
#>  3 Gentoo  Biscoe           51.1          16.3               220        6000
#>  4 Gentoo  Biscoe           48.8          16.2               222        6000
#>  5 Gentoo  Biscoe           45.2          16.4               223        5950
#>  6 Gentoo  Biscoe           49.8          15.9               229        5950
#>  7 Gentoo  Biscoe           48.4          14.6               213        5850
#>  8 Gentoo  Biscoe           49.3          15.7               217        5850
#>  9 Gentoo  Biscoe           55.1          16                 230        5850
#> 10 Gentoo  Biscoe           49.5          16.2               229        5800
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

We can also pipe the results into the filter() function covered last week, to select only penguins weighing more than 5000 g:

penguins_new <-     # assign the results to a dataframe `penguins_new`
  penguins %>%                               # take the penguins data
  arrange(bill_length_mm, bill_depth_mm) %>% # sort by bill length followed by bill depth
  filter(body_mass_g > 5000)                 # select species greater with mass > 5000 g.
head(penguins_new, 5)     # look at the top 5 
#> # A tibble: 5 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Gentoo  Biscoe           44.4          17.3              219        5250 male 
#> 2 Gentoo  Biscoe           44.9          13.3              213        5100 fema…
#> 3 Gentoo  Biscoe           45            15.4              220        5050 male 
#> 4 Gentoo  Biscoe           45.1          14.5              207        5050 fema…
#> 5 Gentoo  Biscoe           45.2          14.8              212        5200 fema…
#> # … with 1 more variable: year <int>

Let’s check the counts of different species and islands among our new dataset:

penguins_new %>%
  count(species)
#> # A tibble: 1 × 2
#>   species     n
#>   <fct>   <int>
#> 1 Gentoo     61
penguins_new %>%
  count(island)
#> # A tibble: 1 × 2
#>   island     n
#>   <fct>  <int>
#> 1 Biscoe    61

You can see that we have only retained Gentoo Penguins from the island of Biscoe.



Breakout session 1 - arrange()

Exercise 1

With the penguins dataset, answer the following questions:

  • Create a new dataset called penguins_shortflippers from the penguins dataset with the 20 penguins with the shortest flippers.

  • How many penguins of each species are found in penguins_shortflippers?

  • Which islands do they come from?

Hints
  • To create penguins_shortflippers, first use arrange() to sort by flipper lengths, and pipe the results into the head() function to get the top 20.

  • To get the species and island composition, use the count() function.

Solution (click here)
  • To create a dataframe with the 20 penguins with the shortest flippers:
penguins_shortflippers <-        # assign the results
  penguins %>%                   # take penguins_data
  arrange(flipper_length_mm) %>% # sort the data by bill flipper length 
  head(20)                       # take the top 20

  • To see the species composition in penguins_shortflippers:
penguins_shortflippers %>%
  count(species)
#> # A tibble: 2 × 2
#>   species       n
#>   <fct>     <int>
#> 1 Adelie       17
#> 2 Chinstrap     3

  • To see the island composition in penguins_shortflippers:
penguins_shortflippers %>%
  count(island)
#> # A tibble: 3 × 2
#>   island        n
#>   <fct>     <int>
#> 1 Biscoe        7
#> 2 Dream         9
#> 3 Torgersen     4


3 - Using mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are derived from existing columns. The mutate() function create new variables, usually by manipulating existing variables.

mutate() always adds new columns at the end of the dataframe. When you use mutate(), you need typically to specify 3 things:

  • the name of the dataframe you want to modify
  • the name of the new column that you’ll create
  • the values to be inserted in the new column

We will be working with the penguins dataset to learn the mutate() function. We will create a new dataframe called mutate_penguins, with a new column called body_mass_g_new.

The first argument (dataset to be piped) is the dataframe we’re going to modify, penguins. After that, we have the name-value pair for our new variable.

Here, the name of the new variable is size and the values are body_mass_g multiplied by flipper_length_mm:

mutate_penguins <-  # assign the results to a dataframe `mutate_penguins`
  penguins %>%      # take the penguins_data
  mutate(size = body_mass_g * flipper_length_mm) # create a new column
head(mutate_penguins) %>%
  select(6:9)       # show the first rows of columns 6-9
#> # A tibble: 6 × 4
#>   body_mass_g sex     year   size
#>         <int> <fct>  <int>  <int>
#> 1        3750 male    2007 678750
#> 2        3800 female  2007 706800
#> 3        3250 female  2007 633750
#> 4          NA NA      2007     NA
#> 5        3450 female  2007 665850
#> 6        3650 male    2007 693500

You can see that we created data with a new column called size.



Breakout session 2 - mutate()

Exercise 2

  • Create a new dataframe called penguins_bills with a new column called bill_shape by dividing bill length by bill depth.

  • What is the species composition of the 20 penguins with the largest values for bill_shape?

Hints (click here)

To get the species composition of the top 20, first use arrange() (think about the direction you need to sort in!), then head(), and then count().

Solution (click here)
  • New dataframe with a bill shape variable:
penguins_bills <-
  penguins %>%      # take the penguins_data
  mutate(bill_shape = bill_length_mm / bill_depth_mm) # Create a new column `bill_shape`   
  • Species composition of the 20 penguins with the largest bill_shape values:
penguins_bills %>%
  arrange(-bill_shape) %>%  # sort by bill_shape in descending order
  head(20) %>%              # take the top 20
  count(species)            # create a frequency table
#> # A tibble: 1 × 2
#>   species     n
#>   <fct>   <int>
#> 1 Gentoo     20

They are all Gentoo penguins!

Exercise 3

Create a new dataframe called penguins_year:

  • with only penguins sampled after 2007,
  • with a new column called year_nr which has a year number that starts counting from 2008 (i.e., 2008 = year 1, 2009 = year 2, etc.)
  • sorted by year_nr.
Hints (click here)

Not all values you pass to mutate() need to be variables! You can subtract year by a fixed number.

Solution (click here)
penguins_year <-
  penguins %>%
  filter(year > 2007) %>%
  mutate(year_nr = year - 2007) %>%
  arrange(year_nr)

penguins_year
#> # A tibble: 234 × 9
#>    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Biscoe           39.6          17.7               186        3500
#>  2 Adelie  Biscoe           40.1          18.9               188        4300
#>  3 Adelie  Biscoe           35            17.9               190        3450
#>  4 Adelie  Biscoe           42            19.5               200        4050
#>  5 Adelie  Biscoe           34.5          18.1               187        2900
#>  6 Adelie  Biscoe           41.4          18.6               191        3700
#>  7 Adelie  Biscoe           39            17.5               186        3550
#>  8 Adelie  Biscoe           40.6          18.8               193        3800
#>  9 Adelie  Biscoe           36.5          16.6               181        2850
#> 10 Adelie  Biscoe           37.6          19.1               194        3750
#> # … with 224 more rows, and 3 more variables: sex <fct>, year <int>,
#> #   year_nr <dbl>



Jelmer Poelstra
Jelmer Poelstra
Bioinformatician at MCIC