S05E01: R for Data Science (2e) - Ch. 5 - Pipes

Introducing a new season of Code Club, in which we will continue to read the book R for Data Science (R4DS), and start with a short chapter on pipes.


Intro to this Code Club Season

Organizers

  • Michael Broe – Evolution, Ecology and Organismal Biology (EEOB)
  • Jessica Cooperstone – Horticulture & Crop Science (HCS) / Food Science & Technology (FST)
  • Stephen Opiyo – Molecular & Cellular Imaging Center (MCIC) - Columbus
  • Jelmer Poelstra – Molecular & Cellular Imaging Center (MCIC) - Wooster

Code Club practicalities

  • In-person (Columbus & Wooster) and Zoom hybrid

  • Mix of instruction/discussion with the entire group, and exercises in groups of 3-4 people.

  • When doing exercises in breakout groups, we encourage you:

    • To briefly introduce yourselves and to do the exercises as a group
    • On Zoom, to turn your cameras on and to have someone share their screen (use the Ask for help button in Zoom to get help from an organizer)
    • To let a less experienced person do the screen sharing and coding
  • You can ask a question at any time, by speaking or typing in the Zoom chat.

  • You can come up to 15 minutes early or stay late for troubleshooting and perhaps a question related to your research.

More general notes:

  • If you can, read or skim the relevant (part of the) chapter before each session, especially if you’re very new to the material. But we’ll always try to present it in such a way that does not assume you’ve read it.

  • We try to make each session as stand-alone as possible, and don’t require you to know anything. That said, if you missed one or more sessions, you’ll get more out of the next ones if you try to catch up with the material.

  • We record the whole-group parts of the Zoom call, and share the recordings only with Code Club participants.

Before moving on the the chapter on pipes, I will start with a very brief overview of the book, the RStudio interface, and how to load R packages.



R for Data Science (R4DS)

This excellent book by Hadley Wickham (also author of many of the R packages used in the book!) and Garret Grolemund is freely available online.

The book focuses on the so-called "tidyverse" ecosystem in R. The tidyverse can be seen as a modern dialect of R. Most of its functionality is also contained in “base R” (that which comes shipped with R by default), but it has an improved and more consistent programming interface or “syntax”.

Last year in Code Club, we worked through the material of a number of chapters of the first edition of the book, which was published in 2016.

Since 2016, quite some R development has taken place. A second edition has been online since a couple of months, with completely updated and also restructured contents – we thought it has improved a lot!

This new edition is not completely finished yet, so you’ll find notifications like these at the top of each chapter:

We decided not to restart at the beginning of the book for this semester. We hope this won’t make it too challenging for beginners to join us. Especially in the first sessions, we’ll make sure to explain all code, including things that were covered last semester.

What’s in the book

The introductory chapter of the book has this figure to show the data science process and what the book will cover:

In terms of what the book does not cover, it may especially be surprising for a book about data science that it contains very little material on statistics (even less so in the second edition, now that there is a companion book “Tidy Modeling with R” on that topic).



Getting Up and Running

RStudio interface

R itself simply provides a “console” (command-line interface) where you can type your commands. RStudio, on the other hand, allows you to see the R console side-by-side with your scripts, plots, and more.

Once you have a running instance of RStudio, create a new R script by clicking File > New File > R Script. Now, you should see all 4 “panes” that the RStudio window is divided into:

  • Top-left: The Editor for your scripts and other documents (hidden when no file is open)
  • Bottom-left: The R Console to interactively run your code (+ other tabs)
  • Top-right: Your Environment with R objects you have created (+ other tabs)
  • Bottom-right: Tabs for Files, Plots, Help, and others


Your turn: Check your R version

Take a look at your version of R: this was printed in the console when you started RStudio (see the RStudio screenshot above).

The most recent version of R is 4.2.2. To use all current functionality of the “base R pipe”, you’ll need at least version 4.2.0, and to use the base R pipe at all, you need at least R version 4.1.

If you have a lower version of R, I would recommend that you update at the end or after this session following these instructions.


R packages

You can think of packages as “add-ons” / “extensions” to base R functionality.

Installation versus loading

To be able to use them, packages have to be installed (usually from within R, using R code). Once you have done this, you don’t need to redo it until you switch to a different version of R.

Unlike installation, loading a package is necessary again and again, in every R session that you want to use it.

The tidyverse

The tidyverse is unusual in that it is a collection of packages that can still be installed and loaded with a single command. The individual core tidyverse packages are the focus of several chapters in the book, for instance:

Package Functionality Main chapter
ggplot2 Creating plots Ch. 2
tidyr & dplyr Manipulating dataframes Ch. 4 & 6
readr Reading in data Ch. 8
stringr Working with “strings” (text) Ch. 16
forcats Working with “factors”
(categorical variables)
Ch. 18
purrr Iteration with functions Ch. 28

Your turn: Load the tidyverse

To check if you can load the tidyverse, run the following and see if you get similar output as printed below:

If instead, you got something like…

#> Error in library(tidyverse) : there is no package called ‘tidyverse’

…then you still need to install it (install.packages("tidyverse")).

The diamonds dataframe

In R, we work a lot with “dataframes”, rectangular data structures like spreadsheets – and in particular, the R4DS book and the tidyverse focus on this very heavily.

Today we’ll see some examples of using the pipe with the diamonds dataframe, which is automatically loaded along with the tidyverse. It contains information on almost 54,000 diamonds (one diamond per row):

# Simply typing the dataframe's name in the console will print the first rows:
diamonds
#> # A tibble: 53,940 × 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # … with 53,930 more rows

(If you get Error: object 'diamonds' not found, then the tidyverse isn’t loaded. Use library(tidyverse) to do so.)



Chapter 5: Pipes

What is a pipe?

A pipe is a programming tool that takes the output of one command (in R, a function), and passes it on to be used as the input for another command.

Pipes prevent you from having to save intermediate output to a file or object. They also make your code shorter and easier to understand.

To give a very minimal example – without a pipe, we can print the number of rows in the diamonds dataframe as follows:

# Use the `nrow()` function with `diamonds` as the sole argument:
nrow(diamonds)
#> [1] 53940

Instead, we could also take the diamonds dataframe, and then pipe (|>) it into the nrow() function:

diamonds |> nrow()
#> [1] 53940

Notice above that we no longer type the input argument to nrow() inside the parentheses: nrow() recognizes that data came in through the pipe.

A more practical example

Let’s say we want to subset the diamonds dataframe to only show the columns color, depth, and price for diamonds with a depth smaller than 50. Without using pipes, we could start by selecting the columns of interest with the select() function, and saving the output in a new dataframe called diamonds_simple:

# The first argument is the input dataframe, the others are the columns we want
diamonds_simple <- select(diamonds, color, depth, price)

Next, we can use the filter() function on diamonds_simple to only return the diamonds (rows) that we want:

# The first argument is the input dataframe, the next is an expression to filter by
filter(diamonds_simple, depth < 50)
#> # A tibble: 3 × 3
#>   color depth price
#>   <ord> <dbl> <int>
#> 1 G        43  3634
#> 2 G        44  4032
#> 3 J        43  4778

But using the pipe, we can do this more elegantly, and without wasting computer memory on an intermediate object:

diamonds |>                       # Take 'diamonds' and push it through the pipe
  select(color, depth, price) |>  # No input is specified, and the output is piped
  filter(depth < 50)              # Again, no input is specified
#> # A tibble: 3 × 3
#>   color depth price
#>   <ord> <dbl> <int>
#> 1 G        43  3634
#> 2 G        44  4032
#> 3 J        43  4778

We took the diamonds dataset and piped it into the select() function, and then we piped the output of select() into the filter() function. Using the the pipe before select() is not necessary and adds a line, but also makes it even easier to see what’s being done!

Like in the earlier example, when we use the pipe, we don’t type the corresponding input argument in the receiving function: it knows to use the piped data. This is not completely “automagical” and foolproof though: what actually happens is that the piped data becomes the first argument to the receiving function.

If you ever need to use the pipe with a function where the piped data is not the first argument, see the Bonus section below about using the _ placeholder.

Two Unix & R examples

Pipes originate in Unix terminals, and are ubiquitous there. So for those of you that are curious, I’ve included two examples of using the Unix pipe, and the corresponding commands in R, in the dropdown box below.

See the examples (click here)

(If you’re trying to follow along yourself: the Unix/terminal examples will only work natively on Mac and Linux, where you can simply click the Terminal tab in the bottom-left RStudio panel next to Console, and issue Unix commands.)

Counting files

You might want to count the number of files in a folder, which involve two distinct processes: obtaining a list of files, and counting them.

Using Unix commands, We can get a list of files in the current folder with ls, perform the counting with wc -l (wordcount -lines), and connect these processes with the pipe |:

# The output of 'ls' is piped (with '|') to 'wc -l':
ls | wc -l


#> 4

So, there happen to be 4 files in the folder this code is run in. We can do the same in R, where the function dir() lists files, while the function length() counts the number of elements:

dir() |>  length()
#> [1] 4


Counting word frequencies

As another example, let’s say we have a file words.txt that contains one word per line:

table
chair
desk
chair
desk
table
chair

In a terminal, we can get a list of unique words and their number of occurrences using:

# 'cat' prints the contents of the file
# 'sort' sorts alphabetically
# 'uniq -c' counts the number of occurrences for each entry 
cat words.txt | sort | uniq -c


#>       3 chair
#>       2 desk
#>       2 table

And to do the same thing in R:

# 'readLines()' reads the contents of a file into R
# 'table()' counts the number of occurrences for each entry
readLines("words.txt") |> table()
#> 
#> chair  desk table 
#>     3     2     2

The %>% pipe and a keyboard shortcut

Those of you who’ve worked with R for a bit are likely familiar with another pipe operator: %>%.

This pipe is loaded as part of the tidyverse, and until recently was very widely used, including in the previous edition of R4DS. There has been a gradual switch to the base R pipe since that was introduced in May 2021, mainly because it does not rely on a package. In addition, it’s convenient that the base R pipe |> is more similar to the Unix pipe |, and is one fewer character to type than %>%.

The number of characters shouldn’t make much of a difference, though, because it remains even quicker to use the RStudio keyboard shortcut for the pipe, which is Ctrl+Shift+M.

There are some differences in the behavior of the |> and %>% pipes in more advanced use cases, which the book chapter goes into (check that out if you have used %>% a lot).

Your turn: Set the |> pipe as default

To make that keyboard shortcut map to the base R pipe (instead of to %>%), go to Tools in the top menu bar, click Global Options, click Code in the left menu, and check the box Use native pipe operator, |> (requires R 4.1+):


Your turn: Use the pipe

With one single “pipeline” (operations connected by a pipe |>), manipulate the diamonds dataframe such that you:

  • Print only the columns carat, cut, depth, and price
  • … for diamonds (rows) with a price of more than $1,000.

Bonus: How many diamonds cost more than $1,000? And could you get this number directly, by expanding your “pipeline”?

Hints(click here)

This is quite similar to the example given above: use the select() function to select certain columns, and the filter() function to select certain rows.

To answer the bonus question: each diamond is on one row, so you are counting rows. And to answer it by expanding your pipeline, recall from the very first pipe example that the nrow() function will print the number of rows.


Solution(click here)
diamonds |>
  select(carat, cut, depth, price) |>
  filter(price > 1000)
#> # A tibble: 39,416 × 4
#>    carat cut       depth price
#>    <dbl> <ord>     <dbl> <int>
#>  1  0.7  Ideal      62.5  2757
#>  2  0.86 Fair       55.1  2757
#>  3  0.7  Ideal      61.6  2757
#>  4  0.71 Very Good  62.4  2759
#>  5  0.78 Very Good  63.8  2759
#>  6  0.7  Good       57.5  2759
#>  7  0.7  Good       59.4  2759
#>  8  0.96 Fair       66.3  2759
#>  9  0.73 Very Good  61.6  2760
#> 10  0.8  Premium    61.5  2760
#> # … with 39,406 more rows

10 rows are printed to screen, and it says … with 39,406 more rows at the bottom: therefore, there are 39,416 diamonds that cost more than $1,000.

You can also get this number with code – for instance by adding the nrow() function to the pipeline, which will count the number of rows (excluding the header line) in a dataframe:

diamonds |>
  select(carat, cut, depth, price) |>
  filter(price > 1000) |> 
  nrow()
#> [1] 39416


Bonus: Using the _ placeholder

By default, the R pipe passes its contents to the first argument of a function. What if we need our piped data to go to another argument than the function’s first one?

Let’s see an example with the gsub() function, which can be used to replace characters in text strings as follows:

# This will replace 'N's with '-' in the string 'ACCGNNT': 
gsub(pattern = "N", replacement = "-", x = "ACCGNNT")
#> [1] "ACCG--T"

(For clarity, I named gsub()’s arguments above. Without naming the arguments, it would be: gsub("N", "-", "ACCGNNT")).

As you could see above, what we would usually think of as the input data, the string passed to the argument x, is not the first but the third argument to gsub().

To make the pipe work with gsub(), use an underscore (_) as a placeholder that indicates where the piped data goes:

"ACCGNNT" |> gsub(pattern = "N", replacement = "-", x = _)
#> [1] "ACCG--T"

As an aside, if you’re wondering how you’d know a function’s argument order, watch the pop-up box when you type a function’s name and the opening parenthesis (see the screenshot below), or check the help e.g. by typing ?gsub in the Console.


Above, I mentioned that the pipe passes its contents to the first argument of a function. But to be more precise, the pipe passes the object to the first argument that you didn’t mention by name. Therefore, the following also works:

# The piped data is being passed to the 3rd argument, 'x',
# which is the first of the function's arguments that we don't refer to below: 
"ACCGNNT" |> gsub(pattern = "N", replacement = "-")
#> [1] "ACCG--T"

Additionally, when you do use the _ placeholder, make sure you always name the argument that you pass it to:

# Using '_' without the argument name ('x=') doesn't work:
"ACCGNNT" |> gsub(pattern = "N", replacement = "-", _)

#> Error: pipe placeholder can only be used as a named argument




Jelmer Poelstra
Jelmer Poelstra
Bioinformatician at MCIC