Code Club S02E05: Intro to the Tidyverse (Part 2)
Prep homework
Basic computer setup
-
If you didn’t already do this, please follow the Code Club Computer Setup instructions, which also has pointers for if you’re new to R or RStudio.
-
If you’re able to do so, please open RStudio a bit before Code Club starts – and in case you run into issues, please join the Zoom call early and we’ll troubleshoot.
Introduction
What will we go over today
- We will continue using the dplyr package, which is part of the tidyverse and was introduced last week.
- Learn using
arrange()
- orders the rows of a data frame by the values of selected columns. - Learn using
mutate()
- adds new variables and preserves existing ones.
1 - What is the dplyr package?
dplyr is one of the tidyverse packages that are designed for data science. dplyr provides functions for data manipulation.
Functions for row-wise operations include:
filter()
- chooses rows based on column values.slice()
- chooses rows based on location.arrange()
- orders the rows of a data frame by the values of selected columns.
Functions for column-wise operations include:
select()
- changes whether or not a column is included.rename
() - changes the name of columns.mutate()
- changes the values of columns and creates new columnsrelocate
() - changes the order of the columns.
Functions for groups of rows include:
summarise()
- collapses a group into a single row.
Last week, we got introduced to the tidyverse and covered the %>%
pipe, select()
, and filter
. We saw that packages are basically R add-ons that contain additional functions or datasets we can use. Using the function install.packages()
, we can install packages that are available at the Comprehensive R Archive Network, or CRAN.
For those who have not installed the tidyverse, let’s install it. We only need to do this once, so if you did this last week, you don’t need to now.
install.packages("tidyverse")
To use the dplyr package within the tidyverse, we need to call it up using library()
.
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
#> ✔ tibble 3.1.4 ✔ dplyr 1.0.7
#> ✔ tidyr 1.1.3 ✔ stringr 1.4.0
#> ✔ readr 2.0.1 ✔ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
2 - Using the arrange()
function
We will learn how to use the arrange()
function from dplyr to sort a data frame in multiple ways. First, we will sort a dataframe by values of a single variable, and then we will learn how to sort a dataframe by more than one variable in the dataframe. By default, dplyr’s arrange()
sorts in ascending order (lowest values first).
Let’s get set up and grab some data so that we have some material to work with.
We will use the same dataset palmerpenguins
we used last week. To get this data, we need to install the palmerpenguins package (again, no need to do this if you already did so last week):
install.packages("palmerpenguins")
Then, to use the package, we need to use the function library()
to load the package in R:
The dataframe we will use today is called penguins
. Let’s take a look at the structure of the data:
# look at the first 10 rows and all columns
head(penguins, 10)
#> # A tibble: 10 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen NA NA NA NA
#> 5 Adelie Torgersen 36.7 19.3 193 3450
#> 6 Adelie Torgersen 39.3 20.6 190 3650
#> 7 Adelie Torgersen 38.9 17.8 181 3625
#> 8 Adelie Torgersen 39.2 19.6 195 4675
#> 9 Adelie Torgersen 34.1 18.1 193 3475
#> 10 Adelie Torgersen 42 20.2 190 4250
#> # … with 2 more variables: sex <fct>, year <int>
# check the structure of penguins_data
# glimpse() which is a part of dplyr functions
# similarly to str() and can be used interchangeably
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Okay, now we have a sense of what the penguins
dataset is.
Now we want to sort the penguins
dataframe by body mass to quickly learn about the lightest penguin and its relations to other variables. We will use the pipe operator %>%
to feed the data to the arrange()
function. We then specify name of the variable that we want to sort the dataframe by.
In this example, we are sorting by variable body_mass_g
, so we will see the lightest penguins at the top of the dataframe:
penguins %>% # take the penguins_data
arrange(body_mass_g) # sort the dataframe in ascending order based on body mass
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Chinstrap Dream 46.9 16.6 192 2700
#> 2 Adelie Biscoe 36.5 16.6 181 2850
#> 3 Adelie Biscoe 36.4 17.1 184 2850
#> 4 Adelie Biscoe 34.5 18.1 187 2900
#> 5 Adelie Dream 33.1 16.1 178 2900
#> 6 Adelie Torgersen 38.6 17 188 2900
#> 7 Chinstrap Dream 43.2 16.6 187 2900
#> 8 Adelie Biscoe 37.9 18.6 193 2925
#> 9 Adelie Dream 37.5 18.9 179 2975
#> 10 Adelie Dream 37 16.9 185 3000
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
If we wanted to sort descendingly, such that the heaviest penguins are in the first rows, we can add a -
in front of the variable:
penguins %>% # take the penguins_data
arrange(-body_mass_g) # sort the dataframe in descending order based on body mass
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Gentoo Biscoe 49.2 15.2 221 6300
#> 2 Gentoo Biscoe 59.6 17 230 6050
#> 3 Gentoo Biscoe 51.1 16.3 220 6000
#> 4 Gentoo Biscoe 48.8 16.2 222 6000
#> 5 Gentoo Biscoe 45.2 16.4 223 5950
#> 6 Gentoo Biscoe 49.8 15.9 229 5950
#> 7 Gentoo Biscoe 48.4 14.6 213 5850
#> 8 Gentoo Biscoe 49.3 15.7 217 5850
#> 9 Gentoo Biscoe 55.1 16 230 5850
#> 10 Gentoo Biscoe 49.5 16.2 229 5800
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
We can also pipe the results into the filter()
function covered last week, to select only penguins weighing more than 5000 g:
penguins_new <- # assign the results to a dataframe `penguins_new`
penguins %>% # take the penguins data
arrange(bill_length_mm, bill_depth_mm) %>% # sort by bill length followed by bill depth
filter(body_mass_g > 5000) # select species greater with mass > 5000 g.
head(penguins_new, 5) # look at the top 5
#> # A tibble: 5 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
#> <fct> <fct> <dbl> <dbl> <int> <int> <fct>
#> 1 Gentoo Biscoe 44.4 17.3 219 5250 male
#> 2 Gentoo Biscoe 44.9 13.3 213 5100 fema…
#> 3 Gentoo Biscoe 45 15.4 220 5050 male
#> 4 Gentoo Biscoe 45.1 14.5 207 5050 fema…
#> 5 Gentoo Biscoe 45.2 14.8 212 5200 fema…
#> # … with 1 more variable: year <int>
Let’s check the counts of different species and islands among our new dataset:
penguins_new %>%
count(species)
#> # A tibble: 1 × 2
#> species n
#> <fct> <int>
#> 1 Gentoo 61
penguins_new %>%
count(island)
#> # A tibble: 1 × 2
#> island n
#> <fct> <int>
#> 1 Biscoe 61
You can see that we have only retained Gentoo Penguins from the island of Biscoe.
Breakout session 1 - arrange()
Exercise 1
With the penguins
dataset, answer the following questions:
-
Create a new dataset called
penguins_shortflippers
from thepenguins
dataset with the 20 penguins with the shortest flippers. -
How many penguins of each species are found in
penguins_shortflippers
? -
Which islands do they come from?
Hints
-
To create
penguins_shortflippers
, first usearrange()
to sort by flipper lengths, and pipe the results into thehead()
function to get the top 20. -
To get the species and island composition, use the
count()
function.
Solution (click here)
- To create a dataframe with the 20 penguins with the shortest flippers:
penguins_shortflippers <- # assign the results
penguins %>% # take penguins_data
arrange(flipper_length_mm) %>% # sort the data by bill flipper length
head(20) # take the top 20
- To see the species composition in
penguins_shortflippers
:
penguins_shortflippers %>%
count(species)
#> # A tibble: 2 × 2
#> species n
#> <fct> <int>
#> 1 Adelie 17
#> 2 Chinstrap 3
- To see the island composition in
penguins_shortflippers
:
penguins_shortflippers %>%
count(island)
#> # A tibble: 3 × 2
#> island n
#> <fct> <int>
#> 1 Biscoe 7
#> 2 Dream 9
#> 3 Torgersen 4
3 - Using mutate()
Besides selecting sets of existing columns, it’s often useful to add new columns that are derived from existing columns. The mutate()
function create new variables, usually by manipulating existing variables.
mutate()
always adds new columns at the end of the dataframe. When you use mutate()
, you need typically to specify 3 things:
- the name of the dataframe you want to modify
- the name of the new column that you’ll create
- the values to be inserted in the new column
We will be working with the penguins dataset to learn the mutate()
function. We will create a new dataframe called mutate_penguins
, with a new column called body_mass_g_new
.
The first argument (dataset to be piped) is the dataframe we’re going to modify, penguins
. After that, we have the name-value pair for our new variable.
Here, the name of the new variable is size
and the values are body_mass_g
multiplied by flipper_length_mm
:
mutate_penguins <- # assign the results to a dataframe `mutate_penguins`
penguins %>% # take the penguins_data
mutate(size = body_mass_g * flipper_length_mm) # create a new column
head(mutate_penguins) %>%
select(6:9) # show the first rows of columns 6-9
#> # A tibble: 6 × 4
#> body_mass_g sex year size
#> <int> <fct> <int> <int>
#> 1 3750 male 2007 678750
#> 2 3800 female 2007 706800
#> 3 3250 female 2007 633750
#> 4 NA NA 2007 NA
#> 5 3450 female 2007 665850
#> 6 3650 male 2007 693500
You can see that we created data with a new column called size
.
Breakout session 2 - mutate()
Exercise 2
-
Create a new dataframe called
penguins_bills
with a new column calledbill_shape
by dividing bill length by bill depth. -
What is the species composition of the 20 penguins with the largest values for
bill_shape
?
Hints (click here)
To get the species composition of the top 20, first use arrange()
(think about the direction you need to sort in!), then head()
, and then count()
.
Solution (click here)
- New dataframe with a bill shape variable:
penguins_bills <-
penguins %>% # take the penguins_data
mutate(bill_shape = bill_length_mm / bill_depth_mm) # Create a new column `bill_shape`
- Species composition of the 20 penguins with the largest
bill_shape
values:
penguins_bills %>%
arrange(-bill_shape) %>% # sort by bill_shape in descending order
head(20) %>% # take the top 20
count(species) # create a frequency table
#> # A tibble: 1 × 2
#> species n
#> <fct> <int>
#> 1 Gentoo 20
They are all Gentoo penguins!
Exercise 3
Create a new dataframe called penguins_year
:
- with only penguins sampled after 2007,
- with a new column called
year_nr
which has a year number that starts counting from 2008 (i.e., 2008 = year 1, 2009 = year 2, etc.) - sorted by
year_nr
.
Hints (click here)
Not all values you pass to mutate()
need to be variables! You can subtract year
by a fixed number.
Solution (click here)
penguins_year <-
penguins %>%
filter(year > 2007) %>%
mutate(year_nr = year - 2007) %>%
arrange(year_nr)
penguins_year
#> # A tibble: 234 × 9
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Biscoe 39.6 17.7 186 3500
#> 2 Adelie Biscoe 40.1 18.9 188 4300
#> 3 Adelie Biscoe 35 17.9 190 3450
#> 4 Adelie Biscoe 42 19.5 200 4050
#> 5 Adelie Biscoe 34.5 18.1 187 2900
#> 6 Adelie Biscoe 41.4 18.6 191 3700
#> 7 Adelie Biscoe 39 17.5 186 3550
#> 8 Adelie Biscoe 40.6 18.8 193 3800
#> 9 Adelie Biscoe 36.5 16.6 181 2850
#> 10 Adelie Biscoe 37.6 19.1 194 3750
#> # … with 224 more rows, and 3 more variables: sex <fct>, year <int>,
#> # year_nr <dbl>