S04E10 and S04E11: R for Data Science - Chapters 5.3 and 5.4

Data Transformation with dplyr, parts II and III: arrange() and filter().

Michael Broe

Oct 27, 2022

Introduction

Use this to download the R Markdown:

rmd_file <- "https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/scripts/Arrange%2C%20Select.Rmd"
download.file(rmd_file, destfile = "arrange_select.Rmd")

Today we are going to finish off the material on the dplyr function arrange() that we didn’t get to last week, and introduce a new dplyr function select().

The way I am presenting this is through an RMarkdown document, to interactively explore the material in these two sections of Chapter 5, Data transformation

Recall that for all dplyr data manipulation functions there is a common template:

the first argument of the function is the input data frame
the next arguments say what you want to do with that data frame, using variable names (no quotes)
the result is a new dataframe

The fact that all these functions have a common template makes it possible to chain steps together, to make complex code chunks out of simple steps, one step at a time, as we will see below.

We will again be using the nycflights13 and tidyverse packages, so we first need make sure these packages are installed, and then load them for the current session by doing library() commands:

library(tidyverse)
library(nycflights13)

`arrange()`

Refer to the biodash page to see where we left off:

S04E08: R for Data Science - Chapter 5.1 - 5.3

arrange() is the equivalent of the Excel sort command.

So what happens if just ‘arrange’ flights (the data frame) with no other arguments?

arrange(flights)#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#>  1  2013     1     1      517        515       2     830     819      11 UA     
#>  2  2013     1     1      533        529       4     850     830      20 UA     
#>  3  2013     1     1      542        540       2     923     850      33 AA     
#>  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
#>  5  2013     1     1      554        600      -6     812     837     -25 DL     
#>  6  2013     1     1      554        558      -4     740     728      12 UA     
#>  7  2013     1     1      555        600      -5     913     854      19 B6     
#>  8  2013     1     1      557        600      -3     709     723     -14 EV     
#>  9  2013     1     1      557        600      -3     838     846      -8 B6     
#> 10  2013     1     1      558        600      -2     753     745       8 AA     
#> # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Nothing changes. This is identical to the original data frame:

head(flights)#> # A tibble: 6 × 19
#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517         515       2     830     819      11 UA     
#> 2  2013     1     1      533         529       4     850     830      20 UA     
#> 3  2013     1     1      542         540       2     923     850      33 AA     
#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
#> 6  2013     1     1      554         558      -4     740     728      12 UA     
#> # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> #   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

But there a whole bunch of variables we can sort by. An easy way to see them is using the glimpse() function, which basically puts the ‘columns into rows’, so you can see them more easily, without scrolling off the screen. This is a great way to see just what the columns are in a complex data frame that you inherit from someone or other.

glimpse(flights)#> Rows: 336,776
#> Columns: 19
#> $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
#> $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
#> $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
#> $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
#> $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
#> $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
#> $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
#> $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
#> $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
#> $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
#> $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
#> $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
#> $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
#> $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
#> $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

(Notice there are 19 columns).

Say we wanted to sort by month, and pull all the ‘Christmassy’ flights to the top. We can arrange by month, and sort descending:

arrange(flights, desc(month))#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#>  1  2013    12     1       13       2359      14     446     445       1 B6     
#>  2  2013    12     1       17       2359      18     443     437       6 B6     
#>  3  2013    12     1      453        500      -7     636     651     -15 US     
#>  4  2013    12     1      520        515       5     749     808     -19 UA     
#>  5  2013    12     1      536        540      -4     845     850      -5 AA     
#>  6  2013    12     1      540        550     -10    1005    1027     -22 B6     
#>  7  2013    12     1      541        545      -4     734     755     -21 EV     
#>  8  2013    12     1      546        545       1     826     835      -9 UA     
#>  9  2013    12     1      549        600     -11     648     659     -11 US     
#> 10  2013    12     1      550        600     -10     825     854     -29 B6     
#> # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

And now say we want to just order the Christmas flights coming into Columbus. We can first filter() on CMH, and then chain (pipe) that filter statement into arrange, using the pipe notation.

First filter on destination CMH:

filter(flights, dest == "CMH")#> # A tibble: 3,524 × 19
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#>  1  2013     1     1      805        815     -10    1006    1010      -4 MQ     
#>  2  2013     1     1     1107       1115      -8    1305    1310      -5 MQ     
#>  3  2013     1     1     1153       1159      -6    1350    1341       9 EV     
#>  4  2013     1     1     1257       1300      -3    1454    1450       4 MQ     
#>  5  2013     1     1     1458       1500      -2    1658    1655       3 MQ     
#>  6  2013     1     1     1459       1501      -2    1651    1651       0 EV     
#>  7  2013     1     1     1522       1530      -8    1731    1725       6 MQ     
#>  8  2013     1     1     1759       1759       0    1957    1949       8 EV     
#>  9  2013     1     1     1816       1805      11    2013    1955      18 MQ     
#> 10  2013     1     1     2008       2015      -7    2206    2210      -4 MQ     
#> # … with 3,514 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

(Notice we now have only 3,524 rows, as opposed to 336,776 in the full data frame).

We can pass that output data frame on to arrange() using the ‘pipe symbol’ %>%:

filter(flights, dest == "CMH") %>% 
  arrange(desc(month))#> # A tibble: 3,524 × 19
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#>  1  2013    12     1      644        614      30     836     805      31 EV     
#>  2  2013    12     1     1129       1135      -6    1316    1330     -14 MQ     
#>  3  2013    12     1     1253       1259      -6    1435    1454     -19 MQ     
#>  4  2013    12     1     1328       1326       2    1507    1524     -17 EV     
#>  5  2013    12     1     1513       1515      -2    1709    1720     -11 MQ     
#>  6  2013    12     1     1629       1455      94    1822    1650      92 MQ     
#>  7  2013    12     1     1728       1730      -2    1918    1925      -7 MQ     
#>  8  2013    12     1     1922       1930      -8    2102    2130     -28 MQ     
#>  9  2013    12     1     1951       1930      21    2121    2115       6 MQ     
#> 10  2013    12     2      622        610      12     759     801      -2 EV     
#> # … with 3,514 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

So this is a first example of chaining together simple steps, to get a more complex result.

Finally, with arrange, you can add other variables to ‘break ties’. In the following example, we first sort on month (descending), and then sort on dep_delay:

filter(flights, dest == "CMH") %>% 
  arrange(desc(month), dep_delay)#> # A tibble: 3,524 × 19
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#>  1  2013    12     4     1910       1930     -20    2101    2130     -29 MQ     
#>  2  2013    12     7     1243       1259     -16    1434    1454     -20 MQ     
#>  3  2013    12    18     1715       1730     -15    1919    1925      -6 MQ     
#>  4  2013    12    26     1918       1930     -12    2116    2130     -14 MQ     
#>  5  2013    12    27     1918       1930     -12    2057    2130     -33 MQ     
#>  6  2013    12    10     1504       1515     -11    1731    1720      11 MQ     
#>  7  2013    12    12     1719       1730     -11    1937    1925      12 MQ     
#>  8  2013    12    16     1919       1930     -11    2108    2130     -22 MQ     
#>  9  2013    12    31     1444       1455     -11    1637    1650     -13 MQ     
#> 10  2013    12    12     1125       1135     -10    1321    1330      -9 MQ     
#> # … with 3,514 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Break out exercises:

Again, see here: S04E08: R for Data Science - Chapter 5.1 - 5.3

at the bottom of the page, under III – Chapter 5.3: arrange().

`select()`

The previous data manipulation functions we’ve looked at, filter() and arrange(), work on observations (i.e. rows).

The next function we’ll look at, select(), works on directly on variables (i.e. columns).

If you have hundreds of columns in a data frame (many of which you may not be interested in for the current analysis) you can subset the columns. We saw above that flights has 19 columns. This is a serious example in terms of rows (336,776), but pretty trivial in terms of columns. But it’s still a good toy example.

It will be useful to import this example data frame into our local Environment so we can visualize it in RStudio (at the moment it is just ‘floating’ out there in the system: we can access it, but we can’t see it).

So after:

library(nycflights13)

We want to do:

my_flights <- flights

(We’ve just created a personal ‘local object’ data frame in our current session, pulling in data from the nycflights13 package). And it should appear in the Environment tab in RStudio.

So, how can we focus on a subset of variables in the flights data frame? One way is to explicitly name the columns you want to keep using the select() function (remember, the first argument is the data frame; the following arguments are the column names):

select(flights, year, month, day)#> # A tibble: 336,776 × 3
#>     year month   day
#>    <int> <int> <int>
#>  1  2013     1     1
#>  2  2013     1     1
#>  3  2013     1     1
#>  4  2013     1     1
#>  5  2013     1     1
#>  6  2013     1     1
#>  7  2013     1     1
#>  8  2013     1     1
#>  9  2013     1     1
#> 10  2013     1     1
#> # … with 336,766 more rows

Often this is fine, but as you move on to larger data sets, you might want to use various selection features instead of just explictly listing want you want to keep.

These are listed if you do:

?select

These use the same syntax and semantics as filter() for rows, using the same logical combinations.

Selection features

`:` ‘range’

You can select a range of columns using the : range operator. This is really only efficient if the original data frame is organized in a way that is useful for your purposes. The good news is that don’t have to select all varibles explicitly, one by one. The range operator selects consecutive variables in the data frame.

select(flights, year:day)#> # A tibble: 336,776 × 3
#>     year month   day
#>    <int> <int> <int>
#>  1  2013     1     1
#>  2  2013     1     1
#>  3  2013     1     1
#>  4  2013     1     1
#>  5  2013     1     1
#>  6  2013     1     1
#>  7  2013     1     1
#>  8  2013     1     1
#>  9  2013     1     1
#> 10  2013     1     1
#> # … with 336,766 more rows

Notice that the month variable is automatically included, even though it’s not mentioned in the select statement.

But there is more organization in this data frame. Say we wanted to drill down just into the departure and arrival times:

select(flights, dep_time:arr_delay)#> # A tibble: 336,776 × 6
#>    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
#>       <int>          <int>     <dbl>    <int>          <int>     <dbl>
#>  1      517            515         2      830            819        11
#>  2      533            529         4      850            830        20
#>  3      542            540         2      923            850        33
#>  4      544            545        -1     1004           1022       -18
#>  5      554            600        -6      812            837       -25
#>  6      554            558        -4      740            728        12
#>  7      555            600        -5      913            854        19
#>  8      557            600        -3      709            723       -14
#>  9      557            600        -3      838            846        -8
#> 10      558            600        -2      753            745         8
#> # … with 336,766 more rows

`!` ‘complement’

You can also drop columns (and ranges of columns) using the logical complement sign:

select(flights, !(year:day))#> # A tibble: 336,776 × 16
#>    dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
#>      <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr> 
#>  1     517     515       2     830     819      11 UA        1545 N14228  EWR   
#>  2     533     529       4     850     830      20 UA        1714 N24211  LGA   
#>  3     542     540       2     923     850      33 AA        1141 N619AA  JFK   
#>  4     544     545      -1    1004    1022     -18 B6         725 N804JB  JFK   
#>  5     554     600      -6     812     837     -25 DL         461 N668DN  LGA   
#>  6     554     558      -4     740     728      12 UA        1696 N39463  EWR   
#>  7     555     600      -5     913     854      19 B6         507 N516JB  EWR   
#>  8     557     600      -3     709     723     -14 EV        5708 N829AS  LGA   
#>  9     557     600      -3     838     846      -8 B6          79 N593JB  JFK   
#> 10     558     600      -2     753     745       8 AA         301 N3ALAA  LGA   
#> # … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#> #   variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
#> #   ⁵sched_arr_time, ⁶arr_delay

Now we just have 16 columns, as opposed to the original 19.

Just a note. In the text we are using, the syntax is:

select(flights, -(year:day))

But based on this:

select.R

I get the sense that ! is the current recommendation for taking the complement in select() statements and that - is deprecated. And FYI: documentation always lags behind implementation.

We can do exactly the same thing by explicitly listing the columns we want to drop, but, there is a gotcha here.

This (which was my first guess) does not work:

select(flights, !(year, month, day)

Instead, we need to wrap the dropped columns in a c() vector:

select(flights, !c(year, month, day)) #> # A tibble: 336,776 × 16
#>    dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
#>      <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr> 
#>  1     517     515       2     830     819      11 UA        1545 N14228  EWR   
#>  2     533     529       4     850     830      20 UA        1714 N24211  LGA   
#>  3     542     540       2     923     850      33 AA        1141 N619AA  JFK   
#>  4     544     545      -1    1004    1022     -18 B6         725 N804JB  JFK   
#>  5     554     600      -6     812     837     -25 DL         461 N668DN  LGA   
#>  6     554     558      -4     740     728      12 UA        1696 N39463  EWR   
#>  7     555     600      -5     913     854      19 B6         507 N516JB  EWR   
#>  8     557     600      -3     709     723     -14 EV        5708 N829AS  LGA   
#>  9     557     600      -3     838     846      -8 B6          79 N593JB  JFK   
#> 10     558     600      -2     753     745       8 AA         301 N3ALAA  LGA   
#> # … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#> #   variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
#> #   ⁵sched_arr_time, ⁶arr_delay

This is the syntax for combining selections in the current dplyr package.

Let’s check if it works on positive selections:

select(flights, c(year, month, day))#> # A tibble: 336,776 × 3
#>     year month   day
#>    <int> <int> <int>
#>  1  2013     1     1
#>  2  2013     1     1
#>  3  2013     1     1
#>  4  2013     1     1
#>  5  2013     1     1
#>  6  2013     1     1
#>  7  2013     1     1
#>  8  2013     1     1
#>  9  2013     1     1
#> 10  2013     1     1
#> # … with 336,766 more rows

Yep, all the same.

So it seems it’s optional for positive selections, but necessary for negative selections.

Selection helpers

There are other ways to select columns efficiently without explicitly naming them.

Here are some examples of helpers which select variables by pattern-matching over the names. Note that the search term must be wrapped in quotes (since we are searching on text in the column name):

flights %>% 
  select(ends_with("time"))#> # A tibble: 336,776 × 5
#>    dep_time sched_dep_time arr_time sched_arr_time air_time
#>       <int>          <int>    <int>          <int>    <dbl>
#>  1      517            515      830            819      227
#>  2      533            529      850            830      227
#>  3      542            540      923            850      160
#>  4      544            545     1004           1022      183
#>  5      554            600      812            837      116
#>  6      554            558      740            728      150
#>  7      555            600      913            854      158
#>  8      557            600      709            723       53
#>  9      557            600      838            846      140
#> 10      558            600      753            745      138
#> # … with 336,766 more rows

flights %>% 
  select(starts_with("time"))#> # A tibble: 336,776 × 1
#>    time_hour          
#>    <dttm>             
#>  1 2013-01-01 05:00:00
#>  2 2013-01-01 05:00:00
#>  3 2013-01-01 05:00:00
#>  4 2013-01-01 05:00:00
#>  5 2013-01-01 06:00:00
#>  6 2013-01-01 05:00:00
#>  7 2013-01-01 06:00:00
#>  8 2013-01-01 06:00:00
#>  9 2013-01-01 06:00:00
#> 10 2013-01-01 06:00:00
#> # … with 336,766 more rows

We can combine these two helpers (you can also think of them as ‘filters’ or ‘constraints’ to make contact with other programming languages) using logical operators. Here is an ‘OR’ statement using the | syntax. It means the selection returns all selections that match either columns that end with “time” or start with “time”.

flights %>% 
  select(ends_with("time") | starts_with("time"))#> # A tibble: 336,776 × 6
#>    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
#>       <int>          <int>    <int>          <int>    <dbl> <dttm>             
#>  1      517            515      830            819      227 2013-01-01 05:00:00
#>  2      533            529      850            830      227 2013-01-01 05:00:00
#>  3      542            540      923            850      160 2013-01-01 05:00:00
#>  4      544            545     1004           1022      183 2013-01-01 05:00:00
#>  5      554            600      812            837      116 2013-01-01 06:00:00
#>  6      554            558      740            728      150 2013-01-01 05:00:00
#>  7      555            600      913            854      158 2013-01-01 06:00:00
#>  8      557            600      709            723       53 2013-01-01 06:00:00
#>  9      557            600      838            846      140 2013-01-01 06:00:00
#> 10      558            600      753            745      138 2013-01-01 06:00:00
#> # … with 336,766 more rows

In this particular case there is a more compact way to get the same result, using the contains() helper. But this solution has lower resolution, since “time” could be anywhere in the column name.

flights %>% 
  select(contains("time"))#> # A tibble: 336,776 × 6
#>    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
#>       <int>          <int>    <int>          <int>    <dbl> <dttm>             
#>  1      517            515      830            819      227 2013-01-01 05:00:00
#>  2      533            529      850            830      227 2013-01-01 05:00:00
#>  3      542            540      923            850      160 2013-01-01 05:00:00
#>  4      544            545     1004           1022      183 2013-01-01 05:00:00
#>  5      554            600      812            837      116 2013-01-01 06:00:00
#>  6      554            558      740            728      150 2013-01-01 05:00:00
#>  7      555            600      913            854      158 2013-01-01 06:00:00
#>  8      557            600      709            723       53 2013-01-01 06:00:00
#>  9      557            600      838            846      140 2013-01-01 06:00:00
#> 10      558            600      753            745      138 2013-01-01 06:00:00
#> # … with 336,766 more rows

But you get the idea. The usefulness of these ‘selection helpers’ depends on the column naming conventions you create (or most likely inherit) from a colleague or online.

Break out exercises:

See 5.4.1 in the text.

But just state with these:

Create a couple of select queries which pull out dep_time, dep_delay, arr_time, and arr_delay from flights, just using what we learned above.

One should be explicit, and others (at least slightly!) more efficient.

If you get through these with no issues, feel free to explore more of the execises in 5.4.1.

codeclub r4ds

← S04E09: R for Data Science - Chapter 5.1 - 5.2 S04E12: R for Data Science - Chapter 5.5: mutate →

S04E10 and S04E11: R for Data Science - Chapters 5.3 and 5.4

Introduction

arrange()

Break out exercises:

select()

Selection features

: ‘range’

! ‘complement’

Selection helpers

Break out exercises:

Michael Broe

Bioinformatician at EEOB

`arrange()`

`select()`

`:` ‘range’

`!` ‘complement’