Session 1: Backyard Birds

Starting with RStudio Projects and reading in our data

Red-breasted Nuthatch, Sitta canadensis





Prep homework

Basic computer setup

If you didn’t already do this, please follow the Code Club Computer Setup instructions.

Test if it works

Please open RStudio locally or start an OSC RStudio Server session.

Nov 19 addition: If you’re working locally, test if you can load the tidyverse package with library("tidyverse") inside R. (If you haven’t installed the tidyverse yet, please go to the Code Club Computer Setup instructions.)

If you have not used RStudio before, take a moment to explore what’s in the panels and tabs. (It may help to check out Mike Sovic’s 1-minute intro to the RStudio interface or RStudio’s 3-minute intro.)

If you’re able to do so, please open RStudio again a bit before Code Club starts – and in case you run into issues, please join the Zoom call early and we’ll troubleshoot.

New to R?

If you’re completely new to R, it will be useful to have a look at some of the resources listed on our New to R? page prior to Code Club.



Slides

On Friday, we started with a couple of introductory slides.




1 - Create an RStudio Project

Projects are an RStudio-specific concept that create a special file (.Rproj), primarily to designate a directory as the working directory for everything within it. We recommend creating exactly one separate Project for each research project with an R component – and for things like Code Club.


Why use Projects?

In brief, Projects help you to organize your work and to make it more portable.

  • They record which scripts (and R Markdown files) are open in RStudio, and will reopen all of those when you reopen the project. This becomes quite handy, say, when you work on three different projects, each of which uses a number of scripts.

  • When using Projects, you generally don’t have to manually set your working directory, and can use relative file paths to refer to files within the project. This way, even if you move the project directory, or copy it to a different computer, the same paths will still work. (This would not be the case if you used setwd() which will generally require you to use an absolute path, e.g. setwd("C:/Users/Jelmer/Documents/").)

  • Projects encourage you to organize research projects inside self-contained directories, rather than with files spread around your computer. This can save you a lot of headaches and increases reproducibility. And because R will restart whenever you switch Projects, there is no risk of unwanted cross-talk between your projects.


Let’s create an RStudio Project for Code Club:

  • Open RStudio locally or start an OSC RStudio Server session.
    (If you’re at OSC, you should see a file 0_CODECLUB.md that’s open in your top-left panel. You can ignore/close this file.)

  • If you’re working locally, create a directory wherever you like on your computer for all things Code Club. You can do this in R using dir.create("path/to/your/dir"), or outside of R.
    (If you’re at OSC, skip this step because you’re automatically inside a Code Club-specific, personal directory.)

  • Click File (top menu bar) > New Project, and then select Existing Directory.

    • If you’re working locally, select the Code Club directory that you created in the previous step.

    • If you’re working at OSC, keep the default choice “~” (i.e., home), which is the directory you started in when entering the RStudio Server session.

  • After RStudio automatically reloads, you should see the file ending in .Rproj in the RStudio Files tab in the lower right pane, and you will have the Project open. All done for now!

(For future Code Club sessions: RStudio will by default reopen the most recently used Project, and therefore, OSC users will have the Project automatically opened. If you’re working locally and are also using other Projects, you can open this Project with File > Open Project inside RStudio, or by clicking the .Rproj file in your file browser, which will open RStudio and the Project.)



2 - Orienting ourselves

Where are we?

We don’t need to set our working directory, because our newly created Project is open, and therefore, our working directory is the directory that contains the .Rproj file.

To see where you are, type or copy into the console (bottom left):

# Print the working directory:
getwd()

# List the files in your current directory:
dir()          # This should print at least the `.RProj` file.

Create directories

Create two new directories – one for this session, and one for a dataset that we will download shortly (and will be reusing across sessions):

# Dir for Code Club Session 1:
dir.create("S01")

# Dir for our bird data:
# ("recursive" to create two levels at once.)
dir.create("data/birds/", recursive = TRUE)

Create a script

To keep a record of what we are doing, and to easily modify and rerun earlier commands, we’ll want to save our commands in a script and execute them from there, rather than typing our commands directly in the console.

  • Click File (top menu bar) > New File > R script.

  • Save the script (File > Save) as S01.R inside your S01 directory.

First line of the script

We will now load the core set of 8 tidyverse packages all at once. To do so, type/copy the command below on the first line of the script, and then execute it by clicking Run (top right of script pane) or by pressing Ctrl Enter (Windows/Linux, this should also work in your browser) or ⌘ Enter (Mac).

# If you're working locally, and did not install it yet:
# install.packages("tidyverse")

# Load the tidyverse (meta)package:
library(tidyverse)

#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

#>  ggplot2 3.3.2      purrr   0.3.4
#>  tibble  3.0.4      dplyr   1.0.2
#>  tidyr   1.1.2      stringr 1.4.0
#>  readr   1.3.1      forcats 0.5.0

#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()

If this worked, you should get the same output as shown in the code block above: it attached 8 packages, and it warns that some of its functions are now “masking” base R functions.

The tidyverse is a very popular and useful ecosystem of R packages for data analysis, which we will be using a lot in Code Club.

When we refer to “base R” as opposed to the tidyverse, we mean functions that are loaded in R by default (without loading a package), and that can perform similar operations in a different way.



3 - Getting our dataset

We downloaded a Great Backyard Bird Count (GBBC) dataset from the Global Biodiversity Information Facility (GBIF). Because the file was 3.1 GB large, we selected only the records from Ohio and removed some uninformative columns. We also added columns with English names and the breeding range for each species. We’ll download the resulting much smaller file (41.5 MB) from our Github repo.

The Great Backyard Bird Count

The GBBC is an annual citizen science event where everyone is encouraged to to identify and count birds in their backyard – or anywhere else – for at least 15 minutes, and report their sightings online. Since 2013, it is a global event, but it has been organized in the US and Canada since 1998.

Download the data

Let’s download the dataset using the download.file() function:

# The URL to our file:
birds_file_url <- "https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/birds/backyard-birds_Ohio.tsv"

# The path to the file we want to download to:
birds_file <- "data/birds/backyard-birds_Ohio.tsv"

# Download:
download.file(url = birds_file_url, destfile = birds_file)

Read the data

Now, let’s read the file into R. The .tsv extension (“tab-separated values”) tells us this is a plain text file in which columns are separated by tabs, so we will use a convenience function from the readr package (which is loaded as part of the core set tidyverse packages) for exactly this type of file:

# Read the data:
birds <- read_tsv(file = birds_file)

#> Parsed with column specification:
#> cols(
#>   class = col_character(),
#>   order = col_character(),
#>   family = col_character(),
#>   genus = col_character(),
#>   species = col_character(),
#>   locality = col_character(),
#>   stateProvince = col_character(),
#>   decimalLatitude = col_double(),
#>   decimalLongitude = col_double(),
#>   eventDate = col_datetime(format = ""),
#>   species_en = col_character(),
#>   range = col_character()
#> )

Done! We have now read our data into a tibble, which is a type of data frame (formally a data.frame): R’s object class to deal with tabular data wherein each column can contain a different type of data (numeric, characters/strings, etc).



4 - Exploring backyard birds

Exercise 1

What’s in the dataset?

  • Explore the dataset using some functions and methods you may know to get a quick overview of data(frames), and try to understand what you see. What does a single row represent, and what is in each column? (Be sure to check out the hints below at some point, especially if you’re stuck.)

  • Pay attention to the data types (e.g., “character” or chr) of the different columns, which several of these functions print. The output of our read_tsv() command also printed this information – this function parsed our columns as the types we see now. Were all the columns parsed correctly?

  • How many rows and how many columns does the dataset have?

  • What are some questions you would like to explore with this dataset? We’ll collect some of these and try to answer them in later sessions. If your group has sufficient R skills already, you are also welcome to go ahead and try to answer one or more of these questions.

Hints (click here)
# Type an object's name to print it to screen:
birds
# Same as above, but explicitly calling print():
print(birds)   

# For column-wise information (short for "structure"):
str(birds)
# tidyverse version of str():
glimpse(birds)

# In RStudio, open object in a separate tab:
View(birds)
  • Note that in R, dbl (for “double”) and num (for “numeric”) are both used, and almost interchangeably so, for floating point numbers. (Integers are a separate type that are simply called “integers” and abbreviated as int, but we have no integer columns in this dataset.)

  • read_tsv() parsed our date as a “date-time” (dttm or POSIXct for short), which contains both a date and a time. In our case, it looks like the time is always “00:00:00” and thus doesn’t provide any information.


Solutions (click here)
# Just printing the glimpse() output,
# which will show the number of rows and columns:
glimpse(birds)

#> Rows: 311,441
#> Columns: 12
#> $ class            <chr> "Aves", "Aves", "Aves", "Aves", "Aves", "Aves", "Ave…
#> $ order            <chr> "Passeriformes", "Passeriformes", "Passeriformes", "…
#> $ family           <chr> "Corvidae", "Corvidae", "Corvidae", "Corvidae", "Cor…
#> $ genus            <chr> "Cyanocitta", "Cyanocitta", "Cyanocitta", "Cyanocitt…
#> $ species          <chr> "Cyanocitta cristata", "Cyanocitta cristata", "Cyano…
#> $ locality         <chr> "44805 Ashland", "45244 Cincinnati", "44132 Euclid",…
#> $ stateProvince    <chr> "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohio", "Ohi…
#> $ decimalLatitude  <dbl> 40.86166, 39.10666, 41.60768, 39.24236, 39.28207, 41…
#> $ decimalLongitude <dbl> -82.31558, -84.32972, -81.50085, -84.35545, -84.4688…
#> $ eventDate        <dttm> 2007-02-16, 2007-02-17, 2007-02-17, 2007-02-19, 200…
#> $ species_en       <chr> "Blue Jay", "Blue Jay", "Blue Jay", "Blue Jay", "Blu…
#> $ range            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# You can also check the number of rows and columns directly using:
dim(birds)          # Will return the number of rows and columns

#> [1] 311441     12


nrow(birds)         # Will return the number of rows

#> [1] 311441

ncol(birds)         # Will return the number of columns

#> [1] 12





Bonus material

If your breakout group is done with Exercise 1, you can have a look at the bonus material below which includes another exercise. You can also have a look at this as homework. Or not at all!


readr options for challenging files

Earlier, we successfully read in our file without specifying any arguments other than the file name to the read_tsv() function, i.e. with all the default options. It is not always this easy!

Some options for more complex cases:

  • The more general counterpart of this function is read_delim(), which allows you to specify the delimiter using the sep argument, e.g. delim="\t" for tabs.

  • There are also arguments to these functions for when you need to skip lines, when you don’t have column headers, when you need to specify the column types of some or all the columns, and so forth – see this example:

    my_df <- read_delim(
      file = "file.txt",
      delim = "\t",             # Specify tab as delimiter
      col_names = FALSE,        # First line is not a header
      skip = 3,                 # Skip the first three lines
      comment = "#",            # Skip any line beginning with a "#"
      col_types = cols(         # Specify column types
        col1 = col_character(), # ..We only need to specify columns for 
        col2 = col_double()     # ..which we need non-automatic typing
        )
      )
    

Exercise 2 (Optional)

Read this file!

Try to read the following file into R, which is a modified and much smaller version of the bird dataset.

Make the function parse the “order” column as a factor, and the “year”, “month”, and “day” columns as whatever you think is sensible.

# Download and read the file:
birds2_file_url <- "https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/birds/backyard-birds_read-challenge.txt"
birds2_file <- "data/birds/backyard-birds_read-challenge.txt"
download.file(url = birds2_file_url, destfile = birds2_file)
# Your turn!
birds2 <- read_    # Complete the command
Hints (click here)
  • The file is saved as .txt, so the delimiter is not obvious – first have a look at it (open it in RStudio, a text editor, or the terminal) to determine the delimiter. Then, use read_delim() with manual specification of the delimiter using the delim argument, or use a specialized convenience function.

  • Besides a leading line with no data, there is another problematic line further down. You will need both the skip and comment arguments to circumvent these.

  • Note that readr erroneously parses month as a character column if you don’t manually specify its type.

  • Note that you can also use a succinct column type specification like col_types = "fc", which would parse, for a two-column file, the first column as a factor and the second as a character – type e.g. ?read_tsv for details.


Bare solution (click here)
# With succint column type specification:
birds2 <- read_csv(
  file = birds2_file,
  skip = 1,
  comment = "$",
  col_types = "fcdiii"
  )

# With long column type specification:
birds2 <- read_csv(
  file = birds2_file,
  skip = 1,
  comment = "$",
  col_types = cols(
    order = col_factor(),
    year =  col_integer(),
    month = col_integer(),
    day = col_integer()
    )
  )

Solution with explanations (click here)
# With succinct column type specification:
birds2 <- read_csv(     # `read_csv()`: file is comma-delimited
  file = birds2_file,
  skip = 1,             # First line is not part of the dataframe
  comment = "$",        # Line 228 is a comment that starts with `$`
  col_types = "fcdiii"  # "f" for factor, "c" for character,
  )                     # .."d" for double (=numeric),
                        # .."i" for integer.

# With long column type specification:
birds2 <- read_csv(
  file = birds2_file,
  skip = 1,
  comment = "$",
  col_types = cols(        # We can omit columns for which we
    order = col_factor(),  # ..accept the automatic parsing,
    year =  col_integer(), # ..when using the long specification. 
    month = col_integer(),
    day = col_integer()
    )
  )


Other options for reading tabular data

There are also functions in base R that read tabular data, such as read.table() and read.delim().

These are generally slower than the readr functions, and have less sensible default options to their arguments. Particularly relevant is how columns with characters (strings) are parsed – until R 4.0, which was released earlier this year, base R’s default behavior was to parse them as factors, and this is generally not desirable1. readr functions will never convert columns with strings to factors.

If speed is important, such as when reading in very large files (~ 100s of MBs or larger), you should consider using the fread() function from the data.table package.

Finally, some examples of reading other types of files:

  • Read excel files directly using the readxl package.
  • Read Google Sheets directly from the web using the googlesheets4 package.
  • Read non-tabular data using the base R readLines() function.






  1. You can check which version of R you are running by typing sessionInfo(). You can also check directly how strings are read by default with default.stringsAsFactors(). To avoid conversion to factors, specify stringsAsFactors = FALSE in your read.table() / read.delim() function call. ↩︎


Jelmer Poelstra
Jelmer Poelstra
Bioinformatician at MCIC