Code Club S02E02: An introduction to R (Part 2)


Learning objectives

  • Create objects in R
  • Recognize and use R functions
  • Differentiate between some common object classes and data structures in R
  • Read in data from a file
  • Install and load R packages


1 – Intro

Nearly everything you do in R will involve objects, functions, or (often) both. In this session, we’ll take a quick look at each of these fundamental components for working in R. In addition, we’ll get introduced to R packages (since they’ll provide many of the functions you’ll use), and also practice reading some data in to R.

2 – Objects

Objects are things in R to which a name can be assigned. They’re created using the assignment operator “<-”, which can be thought of as an arrow (it’s actually two separate characters - the less than symbol and dash) that points whatever is on the right side to a name provided on the left side. For example, running the following three lines of code creates objects named “x”, “y”, and “z”, respectively…

x <- 3 + 3
y <- TRUE
z <- "cat"

If you run these lines in RStudio, you’ll see the resulting objects listed in the top right panel. This is really helpful for keeping track of object names, as you’ll often create many objects during an R session. Calling the objects returns their values…

x

#> [1] 6

y

#> [1] TRUE

z

#> [1] "cat"

3 – Functions

We’ll return to objects shortly, but first, let’s take a very basic look at functions, which make up a second really important part of R. You can think of objects as being things in R, while functions do things in R. I’ll start by writing a very simple function…

#define the function
cubed_plus10 <- function(number) {
  number^3+10
}

#apply the function
cubed_plus10(4)

#> [1] 74

No need to get caught up in details of writing the function right now. A couple important things to recognize at this point…

  1. We created a function that took some input - the ‘4’ in the example above, did something to it, and returned some output.
  2. To run the function, we used its name, followed by a set of parentheses. All R functions have that general structure. The parentheses might contain 0, 1, or more items (often referred to as options or arguments).

It’s useful to be able to create your own functions. But if that sounds a little advanced to you at this point, you can still do a lot with R even without knowing how to write your own, as there are lots that have already been written for you. A number of commonly-used functions are available as soon as you start an R session - these are often referred to as “base R” functions.

#some example base R functions
date()

#> [1] "Tue Aug 31 10:24:39 2021"

getwd()

#> [1] "/Users/sovic.1/Desktop/docs/Desktop_clear_4-28-21/S18_dev/biodash.github.io/content/codeclub/S02E02_r-intro_part2"

sqrt(25)

#> [1] 5

4 – Object Classes and Data Structures

Now that we have at least a basic idea of R functions, we’ll turn attention back to objects (and use functions along the way from here on out). The objects we’ve created so far have been pretty simple. Let’s revisit the three from above…

x

#> [1] 6

y

#> [1] TRUE

z

#> [1] "cat"

There are lots of different kinds, or classes of objects in R, and behind the scenes, each object that’s created is immediately assigned to a class (or possibly multiple classes). There’s no real limit to the number of classes that exist, as new ones can be created for specialized cases at any time. But there are a fairly small number of object classes you’ll encounter a lot in R, so we’ll take a look at some of those now. Let’s use the class() function to figure out what class R assigned each of our objects to…

class(x)

#> [1] "numeric"


class(y)

#> [1] "logical"


class(z)

#> [1] "character"

Some very common object classes you’ll encounter…

  • integer
  • double (numeric)
  • logical
  • character
  • factor

I’m going to introduce a new term here that’s closely tied to object classes, and that’s data structures. Again, there are a small number of data structures you’ll work with a lot in R. They include…

  • Vectors
  • Matrices
  • Data Frames
  • Lists
  • Arrays

Today we’ll focus in on two of these - vectors and data frames.

5 – Vectors

Vectors in R share a couple important characteristics…

  1. They’re one-dimensional. In other words, they can be defined by a length property, with length zero, one, or more.
  2. All elements of a vector must be of the same type, or class.
  3. Operations can be performed on vectors - vector recycling rules apply (we’ll see this in the breakout exercises next).

The c() function is useful for creating vectors in R. It stands for “combine”, and allows you to combine multiple items into a single vector object…

odds <- c(1,3,5,7,9)
animals <- c("dog", "cat", "cow")

#view the objects
odds

#> [1] 1 3 5 7 9

animals

#> [1] "dog" "cat" "cow"


#check their class
class(odds)

#> [1] "numeric"

class(animals)

#> [1] "character"

Breakout Rooms I (~10 min.)

Exercise 1: Create Vector Objects

Create two objects (vectors) named short_vec and long_vec. To short_vec, assign the integers 1 through 5, and to long_vec, assign the integers 1 through 10. View each of the objects and check their class.

Hint (click here)


The colon can be used to define a sequence of integers in R, for example, 1:3 represents the vector 1,2,3.

Solution (click here)
short_vec <- 1:5
long_vec <- 1:10

short_vec

#> [1] 1 2 3 4 5

long_vec

#>  [1]  1  2  3  4  5  6  7  8  9 10


class(short_vec)

#> [1] "integer"

class(long_vec)

#> [1] "integer"

Exercise 2: Vector Operations/Recycling

Now try adding the two vectors, short_vec and long_vec, together. Assign the result to a new object named vec_sum. Think about and talk through what you expect the result might look like before you execute the code.

Hint (click here)


Before R performs an operation on any vectors, the vectors involved must be the same length. If they aren’t, the shorter vector is “recycled” until it matches the length of longer vector. Then the operation is performed.

Solution (click here)

vec_sum <- short_vec + long_vec

vec_sum

#>  [1]  2  4  6  8 10  7  9 11 13 15

Exercise 3: Vector Operations/Recycling II

Just to drive this point on vector recycling home a little more, let’s do more operation - this time, subtract 3 (itself a vector of length 1) from short_vec. Again, try to predict what will happen before you run it.

Solution (click here)
short_vec - 3

#> [1] -2 -1  0  1  2

Keep in mind that all elements of a vector in R have to be of the same class. This means you typically won’t see a vector that looks like…

mixed_vector <- c(1, "cat", 4, "dog", TRUE)

You can create such a vector, but in this case, R will view it as a character vector, meaning, for example, the 1 and 4 won’t be treated as numbers, but as characters. Such forcing of an element or object into a specific class is often referred to as coercion. Let’s go back to our object ‘x’…

x

#> [1] 6

class(x)

#> [1] "numeric"

Notice it’s assigned as numeric. Let’s say we wanted R to see it as an integer (just slightly different in R than numeric)…

x <- as.integer(x)

class(x)

#> [1] "integer"

We have used the as.integer() function to coerce x into class integer.


6 – Data Frames

Data frames are another data structure in R you’ll likely use a lot. Some characteristics of data frames…

  1. They have two dimensions (rows, columns)
  2. Each variable (column) needs to have the same number of entries.
  3. All elements of any one column have to be of the same type/class, but different columns can be of different classes.

A good way to think about a data frame is as being analogous to an Excel spreadsheet, with the caveat that all the columns have the same number of entries, which isn’t a requirement in an Excel sheet.

You can create a data frame by hand with the data.frame() function, but in many cases, you’ll create a data frame in R by reading data in from a file with one of a number of functions that are available for that purpose. Let’s try it. Here’s a small example dataset I generated in Excel…

example data set

It’s available at the following address… https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/data_frame/example_df.tsv

(In this case, the data set comes from online, but often it will be a file on your computer, and the path to the file works the same way as how you’ll use the web address below.)


Breakout Rooms II (~10 min.)

Exercise 4: Reading In A Data Frame

Create an object named “data_address” that stores the web address for the dataset.

Hint (click here)


Use the assignment operator and make sure to put the address in quotes to define it as a character string.

Solution (click here)
data_address <- "https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/data/data_frame/example_df.tsv"

Now use the read.table() function to read the dataset in as a data frame. Assign it as an object named ‘exp_data’ and view it in R.

Hint (click here)


Use the data_address object as the first argument in the read.table() function.

Solution (click here)
exp_data <- read.table(data_address)

exp_data

#>    V1        V2        V3        V4
#> 1 Age Height_cm Eye_Color Graduated
#> 2  22    175.26      Blue      TRUE
#> 3  67    170.18     Green      TRUE
#> 4  53     165.1      Blue     FALSE
#> 5  13    134.62     Brown     FALSE
#> 6  19    147.32     Green      TRUE
#> 7  27    185.42      Gray      TRUE
#> 8  30     190.5      Blue      TRUE
#> 9  11    144.78     Brown     FALSE

Exercise 5: Getting Info About Functions

The column names (header) were read in as the first observation, and default column names (i.e. V1, V2, etc) were added. Take a look at the help page for read.table() by typing ?read.table in your R console, or by searching for “read.table” in the search box in the Help tab of the lower right RStudio panel. Look through some of the options/arguments and make an adjustment to fix the column names/header, then view the data frame again.

Hint (click here)


Rerun the command, setting “header” to TRUE, instead of the default FALSE.

Solution (click here)
exp_data <- read.table(data_address, header = TRUE)

exp_data

#>   Age Height_cm Eye_Color Graduated
#> 1  22    175.26      Blue      TRUE
#> 2  67    170.18     Green      TRUE
#> 3  53    165.10      Blue     FALSE
#> 4  13    134.62     Brown     FALSE
#> 5  19    147.32     Green      TRUE
#> 6  27    185.42      Gray      TRUE
#> 7  30    190.50      Blue      TRUE
#> 8  11    144.78     Brown     FALSE

Exercise 6: Practicing With Some Functions

Spend a few minutes playing around with some of the following functions and try to figure out what they do by applying them to the exp_data object. If it’s not clear, use the help…

  • head()
  • dim()
  • nrow()
  • ncol()
  • names()
  • str()
  • summary()
Solution (click here)
head(exp_data)

#>   Age Height_cm Eye_Color Graduated
#> 1  22    175.26      Blue      TRUE
#> 2  67    170.18     Green      TRUE
#> 3  53    165.10      Blue     FALSE
#> 4  13    134.62     Brown     FALSE
#> 5  19    147.32     Green      TRUE
#> 6  27    185.42      Gray      TRUE

#gives preview of object

dim(exp_data)

#> [1] 8 4

#returns dimensions (number of row, number of columns) of the object

nrow(exp_data)

#> [1] 8

#returns number of rows in data frame

ncol(exp_data)

#> [1] 4

#returns number of columns in data frame

names(exp_data)

#> [1] "Age"       "Height_cm" "Eye_Color" "Graduated"

#returns vector of column names

str(exp_data)

#> 'data.frame':  8 obs. of  4 variables:
#>  $ Age      : int  22 67 53 13 19 27 30 11
#>  $ Height_cm: num  175 170 165 135 147 ...
#>  $ Eye_Color: Factor w/ 4 levels "Blue","Brown",..: 1 4 1 2 4 3 1 2
#>  $ Graduated: logi  TRUE TRUE FALSE FALSE TRUE TRUE ...

#summarizes the structure of an object

summary(exp_data)

#>       Age          Height_cm     Eye_Color Graduated      
#>  Min.   :11.00   Min.   :134.6   Blue :3   Mode :logical  
#>  1st Qu.:17.50   1st Qu.:146.7   Brown:2   FALSE:3        
#>  Median :24.50   Median :167.6   Gray :1   TRUE :5        
#>  Mean   :30.25   Mean   :164.1   Green:2                  
#>  3rd Qu.:35.75   3rd Qu.:177.8                            
#>  Max.   :67.00   Max.   :190.5

#provides a summary/summary statistics for individual variables/columns 

7 – R Packages

I mentioned above when talking about functions that many have already been written for you, and some are available as soon as you open up R - those that are considered part of “base R”. All the functions we’ve used up to this point are included in that set. But there are lots of other functions available as part of additional packages you can install and load. The two most common places to get packages are the CRAN and Bioconductor repositories - I did a couple short videos on these as part of this Intro To R Playlist.

As one example, we used the function read.table() above to read in the example dataset. A similar function, read_tsv() is available as part of the readr package, which is available from CRAN, and so can be installed with the install.packages() function…

install.packages("readr")

The installation should only have to be done once. Then we use the library() function to load the library in each R session where we want to use it…

library(readr)

#> Warning: package 'readr' was built under R version 3.6.2

Now we should have access to the readr package and all of the functions contained in it.

exp_data2 <- read_tsv(data_address)

#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   Age = col_double(),
#>   Height_cm = col_double(),
#>   Eye_Color = col_character(),
#>   Graduated = col_logical()
#> )


exp_data2

#> # A tibble: 8 x 4
#>     Age Height_cm Eye_Color Graduated
#>   <dbl>     <dbl> <chr>     <lgl>    
#> 1    22      175. Blue      TRUE     
#> 2    67      170. Green     TRUE     
#> 3    53      165. Blue      FALSE    
#> 4    13      135. Brown     FALSE    
#> 5    19      147. Green     TRUE     
#> 6    27      185. Gray      TRUE     
#> 7    30      190. Blue      TRUE     
#> 8    11      145. Brown     FALSE

You might notice that exp_data2 is a tibble, while exp_data is a data frame (try the class() function on each). This small difference in the types of objects that are returned is one of the differences in the functions read.table() and read_tsv(). While the class of the objects is different, the contents of the objects are the same.

In addition to functions like install.packages() and library() that help you manage packages in R, RStudio also provides some point-and-click ways to do these same things. Check out the packages tab in the bottom-right RStudio panel.


Bonus: Breakout Rooms III (~10 min.)

Exercise 7: Reading In Compressed Data

Let’s look at one more example for a bit more practice with packages and reading data in to R. This time, we’ll try reading in a compressed (gzipped) version of the same example dataset. This one’s available from…

https://github.com/biodash/biodash.github.io/raw/master/assets/data/data_frame/example_df.tsv.gz

Let’s read this dataset in as an object named zip_data. First try using the read.table() function just like before.

Hint (click here)

Try replacing the previous address with the updated address for the compressed file. Remember to set the header argument to TRUE.

Solution (click here)
#create an object storing the web address
zip_address <- "https://github.com/biodash/biodash.github.io/raw/master/assets/data/data_frame/example_df.tsv.gz"
zip_data <- read.table(zip_address, header = TRUE)

read.table() isn’t able to uncompress a file directly from online, so you probably got an error message. However, it can automatically uncompress a file when reading it in locally (from your computer). So, see if you can use the download.file() function to get the compressed file and then read it in in a second step.

Hint (click here)

download.file() requires that two arguments are defined: url and destfile. Check ?download.file for details.

Solution (click here)
download.file(zip_address, destfile = "example_zip_file.tsv.gz")

zip_data <- read.table("example_zip_file.tsv.gz", header = TRUE)

Alternatively, you could install the data.table package and try its fread() function, which is able to download and automatically uncompress a file from online all in one step (though doing so requires another package, R.utils, so you’ll also have to get that one first if you don’t already have it)…

install.packages("R.utils")
install.packages("data.table")
library(R.utils)
library(data.table)

data.table::fread(zip_address)

#>    Age Height_cm Eye_Color Graduated
#> 1:  22    175.26      Blue      TRUE
#> 2:  67    170.18     Green      TRUE
#> 3:  53    165.10      Blue     FALSE
#> 4:  13    134.62     Brown     FALSE
#> 5:  19    147.32     Green      TRUE
#> 6:  27    185.42      Gray      TRUE
#> 7:  30    190.50      Blue      TRUE
#> 8:  11    144.78     Brown     FALSE

Not only does this offer a little more practice with objects, functions, and reading in data, but it also provides one small example of the value in having multiple functions available that do similar things. In this specific case, when reading in data directly from online, you might find read.table() to be a little easier to work with if the data are uncompressed, but fread() might make things a bit easier (one less step) if the file is compressed. And we actually saw a third function, read_tsv() earlier (from the readr package), which offers yet another option for reading in data-frame-like data. This kind of thing is common in R - there are typically multiple ways of doing things, and as you work in R, you’ll continue to pick up more and more efficient ways of doing what you want to do.





Mike Sovic
Mike Sovic
Bioinformatician at CAPS