S03E06: Data structures and subsetting
Overview of data structures and how to access them
New To Code Club?
-
First, check out the Code Club Computer Setup instructions, which also has some pointers that might be helpful if you’re new to R or RStudio.
-
Please open RStudio before Code Club to test things out – if you run into issues, join the Zoom call early and we’ll troubleshoot.
Session Goals
- Learn the uses of base-R’s three subsetting operators:
[ ]
,[[ ]]
, and$
. - Learn how the behavior of these operators varies depending on the data structure you are subsetting (vector, list, or data frame).
- Learn the value of the
str()
command. - Learn how these base-R operators relate to tidyverse commands.
In our previous set of Code Clubs on stats packages, we’ve encountered data structures of various kinds. Here we put the data structure material all in one place, and put it in a wider R context. We’ll move below and beyond the tidyverse to get an overview of accessing various kinds of data structure in R.
Intro: What is ‘subsetting’ anyway?
Subsetting (also known as indexing) is simply a way of using base-R syntax to extract specific pieces of a data structure.
We’ve already seen two dplyr verbs that perform this kind of operation: filter
(to extract specific rows) and select
(to extract specific columns).
But these are tidyverse commands, and only work with data frames.
R has two more basic data structures, from which everything else is built, vectors and lists, and for these we need different subsetting operators (dplyr functions won’t work). We’ll also see that data frames are just a special kind of list, and that base-R subsetting operators also work for these, which can often be useful and efficient.
In your R experience you will almost certainly come across both code and output which does not adhere to tidyverse conventions. We have already come across outputs which are not data frames in SO3E01: t-tests and S03E02: ANOVA which we will revisit below.
Since the behavior of these operators depends on the actual data structure you are working with, it’s useful when experimenting to use them in conjunction with the str()
function, which compactly displays the internal structure of an any R object. A knowledge of the make-up of these data structures is also important when you come to write your own loops, iterations, and functions.
The most important distinction between vectors and lists is within vectors every value must be of the same type: for example, all characters, or all integers, etc. Inside lists, you can mix values of any type.
In addition, a list is best thought of as a general purpose container, which can contain not just mixed values, but also entire vectors of any type.
Since we’ll be comparing base-R with tidyverse functions, we need to load the tidyverse, and we’ll also be using the palmerpenguins
package to reproduce our previous ANOVA results:
Vectors
A vector is absolutely the most basic data structure in R. Every value in a vector must be of the same type.
There are four basic types of vector: integer, double, character, and logical. Vectors are created by hand with the c()
(combine, concatenate) function. We can check the type with the typeof()
operator. This is totally redundant if you just created the vector yourself, but when you are debugging code, or creating a vector using an expression, you might want to check exactly what type of vector is being used:
vec_which <- 1:5
vec_which
#> [1] 1 2 3 4 5
typeof(vec_which)
#> [1] "integer"
vec_chr <- c("a", "b", "c", "d", "e")
vec_chr
#> [1] "a" "b" "c" "d" "e"
typeof(vec_chr)
#> [1] "character"
Vectors have an insanely simple structure:
str(vec_dbl)
#> num [1:5] 1 2 3 4 5
str()
also displays the type, and RStudio displays the result of str()
in the Values pane. (Note that ‘double’ and ‘num(eric)’ mean exactly the same thing in R.)
For such a simple structure, there are a surprisingly large number of ways to subset a vector. We’ll just look a a small sample here, and use the following example:
(Notice for this example we are using a pedagocial trick, where the number after the decimal point indicates the position (index) of the value before the decimal point).
Positive integers return elements at the specified positions. Any expression that evaluates to a vector of positions can be used as the index. The index operator is [ ]
:
x[3]
#> [1] 3.3
x[c(3, 1)]
#> [1] 3.3 2.1
x[2:4]
#> [1] 4.2 3.3 5.4
Negative integers exclude elements at the specified positions:
x[-3]
#> [1] 2.1 4.2 5.4
x[c(-3, -1)]
#> [1] 4.2 5.4
The bottom line here is that each value in a vector has an implicit index (position), and we can use that index to pull out values of interest. This can be extremely useful when writing for-loops that move through a vector accessing one value at a time.
Attributes. One of the unusual features of R as opposed to other programming languages is that you can assign metadata of various kinds to the elements of vectors (and lists). For example, we can assign a name to each element, and then use a character vector as the index expression:
y <- c("a" = 2.1, "b" = 4.2, "c" = 3.3, "d" = 5.4)
str(y)
#> Named num [1:4] 2.1 4.2 3.3 5.4
#> - attr(*, "names")= chr [1:4] "a" "b" "c" "d"
y[c("d", "c", "a")]
#> d c a
#> 5.4 3.3 2.1
The str()
command now shows us that we now have a ‘Named’ numeric vector, and that we have a “names” attribute, which is itself a character vector.
Exercise 1a
Consider the words “yellow”, “red”, and “green”.
Create a numeric vector called “lengths” which simply shows the length of the words (in that order).
Look at its structure using str()
.
Extract the first and last elements of this vector, indexing by position.
Exercise 1b
Now create a second vector called “named_lengths”, with the same word-lengths, but now also using a corresponding names attribute: “yellow”, “red” and “green”.
Look at its structure using str()
.
Again, extract the first and last elements, but now using a character vector as the index.
Lists
There are two main differences between vectors and lists: (i) lists can contain elements of different types; and (ii) lists can contain entire vectors as elements (and even other lists: which is why lists are sometimes referred to as recursive: it can be lists of lists of lists, ‘all the way down’. This is a topic for another day!).
Let’s directly compare the structure of a list of numbers to a vector of numbers. Just as we create vectors by hand with the c()
function, we create lists with the list()
function.
l <- list(2.1, 4.2, 3.3, 5.4)
l
#> [[1]]
#> [1] 2.1
#>
#> [[2]]
#> [1] 4.2
#>
#> [[3]]
#> [1] 3.3
#>
#> [[4]]
#> [1] 5.4
str(l)
#> List of 4
#> $ : num 2.1
#> $ : num 4.2
#> $ : num 3.3
#> $ : num 5.4
Notice the difference between printing a list (all those brackets!!) and using the str()
command, which is much more compact and readable.
What if we mix values of different types?
l_mixed <- list(2.1, 2L, T, "a")
l_mixed
#> [[1]]
#> [1] 2.1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] TRUE
#>
#> [[4]]
#> [1] "a"
str(l_mixed)
#> List of 4
#> $ : num 2.1
#> $ : int 2
#> $ : logi TRUE
#> $ : chr "a"
Things get more interesting when we create a list of vectors:
mixed_vectors <- list(c("kim", "sandy", "lee"), c(23, 21, 26))
mixed_vectors
#> [[1]]
#> [1] "kim" "sandy" "lee"
#>
#> [[2]]
#> [1] 23 21 26
str(mixed_vectors)
#> List of 2
#> $ : chr [1:3] "kim" "sandy" "lee"
#> $ : num [1:3] 23 21 26
In these examples we see the appearance of a new subsetting operator [[ ]]
, in addition to [ ]
. How do they differ? Let’s experiment, focussing on the second element of the list c(23, 21, 26)
. Let’s try to pull out that vector using the [2]
notation (it is the second element after all).
mixed_vectors_subset_a <- mixed_vectors[2]
mixed_vectors_subset_a
#> [[1]]
#> [1] 23 21 26
str(mixed_vectors_subset_a)
#> List of 1
#> $ : num [1:3] 23 21 26
This does not pull out the vector!! Instead, it returns a sublist which contains that vector as the only element (we’ll see why R does this below…).
So how to we get our hands on the actual vector? This is where the [[ ]]
operator comes in:
mixed_vectors_subset_b <- mixed_vectors[[2]]
mixed_vectors_subset_b
#> [1] 23 21 26
str(mixed_vectors_subset_b)
#> num [1:3] 23 21 26
We see that the behavior of the [ ]
operator is very different for lists: it selects the element(s) you request, but always still wrapped inside a list. It ‘shrinks’ the original list. The [[ ]]
operator on the other hand ‘drills-down’ and just returns the ‘un-listed’ vector in that position.
Data frames
The reason R does things this way is because data frames are so central to the language. Let’s build a data frame from the ground up to see how it works.
Basically the “inside” of a data frame is just a list with name attributes:
attr_list <- list(name = c("kim", "sandy", "lee"), age = c(23, 21, 26))
attr_list
#> $name
#> [1] "kim" "sandy" "lee"
#>
#> $age
#> [1] 23 21 26
str(attr_list)
#> List of 2
#> $ name: chr [1:3] "kim" "sandy" "lee"
#> $ age : num [1:3] 23 21 26
Notice that instead of str()
displaying $ : ...
for each entry, we now see attributes $ name: ...
, $ age: ...
for each entry. Also note that all those double [[ ]]
notations have disappeared when we print. This should give you a clue that $age
, for example, is a kind of alias for [[2]]
.
Finally, we can ‘wrap’ this into an official data frame structure:
my_df <- as.data.frame(attr_list)
my_df
#> name age
#> 1 kim 23
#> 2 sandy 21
#> 3 lee 26
str(my_df)
#> 'data.frame': 3 obs. of 2 variables:
#> $ name: chr "kim" "sandy" "lee"
#> $ age : num 23 21 26
Or wrap it into a tidyverse tibble:
my_tibble <- as_tibble(attr_list)
my_tibble
#> # A tibble: 3 × 2
#> name age
#> <chr> <dbl>
#> 1 kim 23
#> 2 sandy 21
#> 3 lee 26
str(my_tibble)
#> tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
#> $ name: chr [1:3] "kim" "sandy" "lee"
#> $ age : num [1:3] 23 21 26
So a data frame is basically a list (intepreted as columns), with a names attribute for the columns (interpreted as headers). And with the extra condition that all the columns are of the same length, so it’s rectangular. So we should be able to use our standard list subsetting operators on it:
col_2 <- my_df[2]
str(col_2)
#> 'data.frame': 3 obs. of 1 variable:
#> $ age: num 23 21 26
Since a data frame is a list, subsetting using [ ]
returns the specified column still inside a data frame. What about [[ ]]
?
vec_2 <- my_df[[2]]
str(vec_2)
#> num [1:3] 23 21 26
Using [[ ]]
pulls out the data vector from the column.
Just like vectors, we can also subset a data frame by the name attribute, instead of by position:
col_2_by_name <- my_df["age"]
str(col_2_by_name)
#> 'data.frame': 3 obs. of 1 variable:
#> $ age: num 23 21 26
vec_2_by_name <- my_df[["age"]]
str(vec_2_by_name)
#> num [1:3] 23 21 26
Finally my_df$age
is simply a shorthand for my_df[["age"]]
without the [[ ]]
and the " "
:
vec_2_by_dollar_name <- my_df$age
str(vec_2_by_dollar_name)
#> num [1:3] 23 21 26
Direct comparison with tidyverse functions
The dplyr command select()
over a data frame is exactly analogous to the single bracket operator my_df["age"]
. It returns a data frame with a single column:
col_2_select <- select(my_df, "age")
col_2_select
#> age
#> 1 23
#> 2 21
#> 3 26
str(col_2_select)
#> 'data.frame': 3 obs. of 1 variable:
#> $ age: num 23 21 26
The dplyr command pull()
over a data frame is exactly analogous to the double bracket operator my_df[["age"]]
. It returns the data vector inside that column:
The ‘problem’ with these dplyr functions is that they require a data frame as input, and we recently saw in S03E01: t-tests that the statistical t-test output was not a data frame:
pop1 <- rnorm(n = 20, mean = 10, sd = 3)
pop2 <- rnorm(n = 20, mean = 10, sd = 3)
tresult <- t.test(x = pop1, y = pop2)
str(tresult)
#> List of 10
#> $ statistic : Named num 0.32
#> ..- attr(*, "names")= chr "t"
#> $ parameter : Named num 37.9
#> ..- attr(*, "names")= chr "df"
#> $ p.value : num 0.751
#> $ conf.int : num [1:2] -1.77 2.44
#> ..- attr(*, "conf.level")= num 0.95
#> $ estimate : Named num [1:2] 10.17 9.84
#> ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
#> $ null.value : Named num 0
#> ..- attr(*, "names")= chr "difference in means"
#> $ stderr : num 1.04
#> $ alternative: chr "two.sided"
#> $ method : chr "Welch Two Sample t-test"
#> $ data.name : chr "pop1 and pop2"
#> - attr(*, "class")= chr "htest"
This is not a data frame, but an ‘htest’ class object. Further, it cannot be converted to a data frame in the usual way:
as.data.frame(tresult)
Error in as.data.frame.default(tresult): cannot coerce class ‘"htest"’ to a data.frame
This is precisely why the tidyverse developed the broom::tidy()
function, which works with legacy base-R outputs, and converts them to data frames. But if you have lots of t-tests, the overhead of converting all the outputs using broom, then using dplyr functions to access data, can be inefficient and overkill.
The t-test output is not a data frame, but it is a named list, so we can subset it directly. For example, to pull out the p.value
we can do either:
tresult[[3]]
#> [1] 0.7507211
or
tresult$p.value
#> [1] 0.7507211
which is really much simpler than going through broom. In addition, we can get extra granularity very quickly using this notation. Say we want the lower bound of the confidence interval. We can ‘stack’ indexes:
tresult[[4]][[1]]
#> [1] -1.774624
or
tresult$conf.int[[1]]
#> [1] -1.774624
This is saying ‘give me the 4th element (or the conf.int element), and then give me the 1st element of that’.
Exercise 2
Reuse the t.test() code above, run str
on the output, and extract the stderr
value using both the $
and [[ ]]
indexing approaches.
Solution (click here)
tresult$stderr
#> [1] 1.041051
tresult[[7]]
#> [1] 1.041051
Exercise 3
When we ran our first ANOVA, we never actually looked at the data structure that was produced.
Run the following code and inspect the output.
bill_length_anova <-
aov(data = penguins %>% drop_na(),
bill_length_mm ~ species + sex + species*sex)
str(bill_length_anova)
Aieee!
What happens when you try to turn this into a data frame, using as.data.frame(bill_length_anova)
?
Now you can see why broom:tidy()
is so useful! To remind yourselves what the tidied version looks like, run the code:
tidy_anova <- broom::tidy(bill_length_anova)
tidy_anova
But we can still extract values from this data structure directly: you just have to work out where to look…
See if you can extract the total residual df from this data structure using the $ notation.
Solution (click here)
bill_length_anova <-
aov(data = penguins %>% drop_na(),
bill_length_mm ~ species + sex + species*sex)
str(bill_length_anova)
#> List of 13
#> $ coefficients : Named num [1:6] 37.26 9.32 8.31 3.13 1.39 ...
#> ..- attr(*, "names")= chr [1:6] "(Intercept)" "speciesChinstrap" "speciesGentoo" "sexmale" ...
#> $ residuals : Named num [1:333] -1.29 2.242 3.042 -0.558 -1.09 ...
#> ..- attr(*, "names")= chr [1:333] "1" "2" "3" "4" ...
#> $ effects : Named num [1:333] -802.79 44.75 70.8 33.7 3.82 ...
#> ..- attr(*, "names")= chr [1:333] "(Intercept)" "speciesChinstrap" "speciesGentoo" "sexmale" ...
#> $ rank : int 6
#> $ fitted.values: Named num [1:333] 40.4 37.3 37.3 37.3 40.4 ...
#> ..- attr(*, "names")= chr [1:333] "1" "2" "3" "4" ...
#> $ assign : int [1:6] 0 1 1 2 3 3
#> $ qr :List of 5
#> ..$ qr : num [1:333, 1:6] -18.2483 0.0548 0.0548 0.0548 0.0548 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:333] "1" "2" "3" "4" ...
#> .. .. ..$ : chr [1:6] "(Intercept)" "speciesChinstrap" "speciesGentoo" "sexmale" ...
#> .. ..- attr(*, "assign")= int [1:6] 0 1 1 2 3 3
#> .. ..- attr(*, "contrasts")=List of 2
#> .. .. ..$ species: chr "contr.treatment"
#> .. .. ..$ sex : chr "contr.treatment"
#> ..$ qraux: num [1:6] 1.05 1.03 1.05 1.05 1.03 ...
#> ..$ pivot: int [1:6] 1 2 3 4 5 6
#> ..$ tol : num 1e-07
#> ..$ rank : int 6
#> ..- attr(*, "class")= chr "qr"
#> $ df.residual : int 327
#> $ contrasts :List of 2
#> ..$ species: chr "contr.treatment"
#> ..$ sex : chr "contr.treatment"
#> $ xlevels :List of 2
#> ..$ species: chr [1:3] "Adelie" "Chinstrap" "Gentoo"
#> ..$ sex : chr [1:2] "female" "male"
#> $ call : language aov(formula = bill_length_mm ~ species + sex + species * sex, data = penguins %>% drop_na())
#> $ terms :Classes 'terms', 'formula' language bill_length_mm ~ species + sex + species * sex
#> .. ..- attr(*, "variables")= language list(bill_length_mm, species, sex)
#> .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
#> .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. ..$ : chr [1:3] "bill_length_mm" "species" "sex"
#> .. .. .. ..$ : chr [1:3] "species" "sex" "species:sex"
#> .. ..- attr(*, "term.labels")= chr [1:3] "species" "sex" "species:sex"
#> .. ..- attr(*, "order")= int [1:3] 1 1 2
#> .. ..- attr(*, "intercept")= int 1
#> .. ..- attr(*, "response")= int 1
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. ..- attr(*, "predvars")= language list(bill_length_mm, species, sex)
#> .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "factor" "factor"
#> .. .. ..- attr(*, "names")= chr [1:3] "bill_length_mm" "species" "sex"
#> $ model :'data.frame': 333 obs. of 3 variables:
#> ..$ bill_length_mm: num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
#> ..$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
#> ..- attr(*, "terms")=Classes 'terms', 'formula' language bill_length_mm ~ species + sex + species * sex
#> .. .. ..- attr(*, "variables")= language list(bill_length_mm, species, sex)
#> .. .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
#> .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. ..$ : chr [1:3] "bill_length_mm" "species" "sex"
#> .. .. .. .. ..$ : chr [1:3] "species" "sex" "species:sex"
#> .. .. ..- attr(*, "term.labels")= chr [1:3] "species" "sex" "species:sex"
#> .. .. ..- attr(*, "order")= int [1:3] 1 1 2
#> .. .. ..- attr(*, "intercept")= int 1
#> .. .. ..- attr(*, "response")= int 1
#> .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> .. .. ..- attr(*, "predvars")= language list(bill_length_mm, species, sex)
#> .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "factor" "factor"
#> .. .. .. ..- attr(*, "names")= chr [1:3] "bill_length_mm" "species" "sex"
#> - attr(*, "class")= chr [1:2] "aov" "lm"
as.data.frame(bill_length_anova)
Error in as.data.frame.default(bill_length_anova) :
cannot coerce class ‘c("aov", "lm")’ to a data.frame
tidy_anova <- broom::tidy(bill_length_anova)
tidy_anova
#> # A tibble: 4 × 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 species 2 7015. 3508. 654. 5.03e-115
#> 2 sex 1 1136. 1136. 212. 2.42e- 37
#> 3 species:sex 2 24.5 12.2 2.28 1.03e- 1
#> 4 Residuals 327 1753. 5.36 NA NA
bill_length_anova$df.residual
#> [1] 327