S04E10 and S04E11: R for Data Science - Chapters 5.3 and 5.4
Data Transformation with dplyr, parts II and III: arrange()
and filter()
.
Introduction
Use this to download the R Markdown:
rmd_file <- "https://raw.githubusercontent.com/biodash/biodash.github.io/master/assets/scripts/Arrange%2C%20Select.Rmd"
download.file(rmd_file, destfile = "arrange_select.Rmd")
Today we are going to finish off the material on the dplyr function arrange()
that we didn’t get to last week, and introduce a new dplyr function select()
.
The way I am presenting this is through an RMarkdown document, to interactively explore the material in these two sections of Chapter 5, Data transformation
Recall that for all dplyr data manipulation functions there is a common template:
-
the first argument of the function is the input data frame
-
the next arguments say what you want to do with that data frame, using variable names (no quotes)
-
the result is a new dataframe
The fact that all these functions have a common template makes it possible to chain steps together, to make complex code chunks out of simple steps, one step at a time, as we will see below.
We will again be using the nycflights13
and tidyverse
packages, so we first need make sure these packages are installed, and then load them for the current session by doing library()
commands:
arrange()
Refer to the biodash page to see where we left off:
S04E08: R for Data Science - Chapter 5.1 - 5.3
arrange()
is the equivalent of the Excel sort
command.
So what happens if just ‘arrange’ flights (the data frame) with no other arguments?
arrange(flights)#> # A tibble: 336,776 × 19
#> year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 1 1 517 515 2 830 819 11 UA
#> 2 2013 1 1 533 529 4 850 830 20 UA
#> 3 2013 1 1 542 540 2 923 850 33 AA
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
#> 6 2013 1 1 554 558 -4 740 728 12 UA
#> 7 2013 1 1 555 600 -5 913 854 19 B6
#> 8 2013 1 1 557 600 -3 709 723 -14 EV
#> 9 2013 1 1 557 600 -3 838 846 -8 B6
#> 10 2013 1 1 558 600 -2 753 745 8 AA
#> # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Nothing changes. This is identical to the original data frame:
head(flights)#> # A tibble: 6 × 19
#> year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 1 1 517 515 2 830 819 11 UA
#> 2 2013 1 1 533 529 4 850 830 20 UA
#> 3 2013 1 1 542 540 2 923 850 33 AA
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
#> 6 2013 1 1 554 558 -4 740 728 12 UA
#> # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
But there a whole bunch of variables we can sort by. An easy way to see them is using the glimpse()
function, which basically puts the ‘columns into rows’, so you can see them more easily, without scrolling off the screen. This is a great way to see just what the columns are in a complex data frame that you inherit from someone or other.
glimpse(flights)#> Rows: 336,776
#> Columns: 19
#> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
#> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
#> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
#> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
#> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
#> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
#> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
#> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
#> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
#> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
#> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
#> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
#> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
#> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
#> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
(Notice there are 19 columns).
Say we wanted to sort by month, and pull all the ‘Christmassy’ flights to the top. We can arrange by month, and sort descending:
arrange(flights, desc(month))#> # A tibble: 336,776 × 19
#> year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 12 1 13 2359 14 446 445 1 B6
#> 2 2013 12 1 17 2359 18 443 437 6 B6
#> 3 2013 12 1 453 500 -7 636 651 -15 US
#> 4 2013 12 1 520 515 5 749 808 -19 UA
#> 5 2013 12 1 536 540 -4 845 850 -5 AA
#> 6 2013 12 1 540 550 -10 1005 1027 -22 B6
#> 7 2013 12 1 541 545 -4 734 755 -21 EV
#> 8 2013 12 1 546 545 1 826 835 -9 UA
#> 9 2013 12 1 549 600 -11 648 659 -11 US
#> 10 2013 12 1 550 600 -10 825 854 -29 B6
#> # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
And now say we want to just order the Christmas flights coming into Columbus. We can first filter()
on CMH
, and then chain (pipe) that filter statement into arrange, using the pipe notation.
First filter on destination CMH
:
filter(flights, dest == "CMH")#> # A tibble: 3,524 × 19
#> year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 1 1 805 815 -10 1006 1010 -4 MQ
#> 2 2013 1 1 1107 1115 -8 1305 1310 -5 MQ
#> 3 2013 1 1 1153 1159 -6 1350 1341 9 EV
#> 4 2013 1 1 1257 1300 -3 1454 1450 4 MQ
#> 5 2013 1 1 1458 1500 -2 1658 1655 3 MQ
#> 6 2013 1 1 1459 1501 -2 1651 1651 0 EV
#> 7 2013 1 1 1522 1530 -8 1731 1725 6 MQ
#> 8 2013 1 1 1759 1759 0 1957 1949 8 EV
#> 9 2013 1 1 1816 1805 11 2013 1955 18 MQ
#> 10 2013 1 1 2008 2015 -7 2206 2210 -4 MQ
#> # … with 3,514 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
(Notice we now have only 3,524 rows, as opposed to 336,776 in the full data frame).
We can pass that output data frame on to arrange()
using the ‘pipe symbol’ %>%
:
filter(flights, dest == "CMH") %>%
arrange(desc(month))#> # A tibble: 3,524 × 19
#> year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 12 1 644 614 30 836 805 31 EV
#> 2 2013 12 1 1129 1135 -6 1316 1330 -14 MQ
#> 3 2013 12 1 1253 1259 -6 1435 1454 -19 MQ
#> 4 2013 12 1 1328 1326 2 1507 1524 -17 EV
#> 5 2013 12 1 1513 1515 -2 1709 1720 -11 MQ
#> 6 2013 12 1 1629 1455 94 1822 1650 92 MQ
#> 7 2013 12 1 1728 1730 -2 1918 1925 -7 MQ
#> 8 2013 12 1 1922 1930 -8 2102 2130 -28 MQ
#> 9 2013 12 1 1951 1930 21 2121 2115 6 MQ
#> 10 2013 12 2 622 610 12 759 801 -2 EV
#> # … with 3,514 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
So this is a first example of chaining together simple steps, to get a more complex result.
Finally, with arrange, you can add other variables to ‘break ties’. In the following example, we first sort on month
(descending), and then sort on dep_delay
:
filter(flights, dest == "CMH") %>%
arrange(desc(month), dep_delay)#> # A tibble: 3,524 × 19
#> year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 12 4 1910 1930 -20 2101 2130 -29 MQ
#> 2 2013 12 7 1243 1259 -16 1434 1454 -20 MQ
#> 3 2013 12 18 1715 1730 -15 1919 1925 -6 MQ
#> 4 2013 12 26 1918 1930 -12 2116 2130 -14 MQ
#> 5 2013 12 27 1918 1930 -12 2057 2130 -33 MQ
#> 6 2013 12 10 1504 1515 -11 1731 1720 11 MQ
#> 7 2013 12 12 1719 1730 -11 1937 1925 12 MQ
#> 8 2013 12 16 1919 1930 -11 2108 2130 -22 MQ
#> 9 2013 12 31 1444 1455 -11 1637 1650 -13 MQ
#> 10 2013 12 12 1125 1135 -10 1321 1330 -9 MQ
#> # … with 3,514 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Break out exercises:
Again, see here: S04E08: R for Data Science - Chapter 5.1 - 5.3
at the bottom of the page, under III – Chapter 5.3: arrange().
select()
The previous data manipulation functions we’ve looked at, filter()
and arrange()
, work on observations (i.e. rows).
The next function we’ll look at, select()
, works on directly on variables (i.e. columns).
If you have hundreds of columns in a data frame (many of which you may not be interested in for the current analysis) you can subset the columns. We saw above that flights
has 19 columns. This is a serious example in terms of rows (336,776), but pretty trivial in terms of columns. But it’s still a good toy example.
It will be useful to import this example data frame into our local Environment so we can visualize it in RStudio (at the moment it is just ‘floating’ out there in the system: we can access it, but we can’t see it).
So after:
We want to do:
my_flights <- flights
(We’ve just created a personal ‘local object’ data frame in our current session, pulling in data from the nycflights13
package). And it should appear in the Environment tab in RStudio.
So, how can we focus on a subset of variables in the flights data frame? One way is to explicitly name the columns you want to keep using the select()
function (remember, the first argument is the data frame; the following arguments are the column names):
select(flights, year, month, day)#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> 7 2013 1 1
#> 8 2013 1 1
#> 9 2013 1 1
#> 10 2013 1 1
#> # … with 336,766 more rows
Often this is fine, but as you move on to larger data sets, you might want to use various selection features instead of just explictly listing want you want to keep.
These are listed if you do:
?select
These use the same syntax and semantics as filter()
for rows, using the same logical combinations.
Selection features
:
‘range’
You can select a range of columns using the :
range operator. This is really only efficient if the original data frame is organized in a way that is useful for your purposes. The good news is that don’t have to select all varibles explicitly, one by one. The range operator selects consecutive variables in the data frame.
select(flights, year:day)#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> 7 2013 1 1
#> 8 2013 1 1
#> 9 2013 1 1
#> 10 2013 1 1
#> # … with 336,766 more rows
Notice that the month
variable is automatically included, even though it’s not mentioned in the select statement.
But there is more organization in this data frame. Say we wanted to drill down just into the departure and arrival times:
select(flights, dep_time:arr_delay)#> # A tibble: 336,776 × 6
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
#> <int> <int> <dbl> <int> <int> <dbl>
#> 1 517 515 2 830 819 11
#> 2 533 529 4 850 830 20
#> 3 542 540 2 923 850 33
#> 4 544 545 -1 1004 1022 -18
#> 5 554 600 -6 812 837 -25
#> 6 554 558 -4 740 728 12
#> 7 555 600 -5 913 854 19
#> 8 557 600 -3 709 723 -14
#> 9 557 600 -3 838 846 -8
#> 10 558 600 -2 753 745 8
#> # … with 336,766 more rows
!
‘complement’
You can also drop columns (and ranges of columns) using the logical complement
sign:
select(flights, !(year:day))#> # A tibble: 336,776 × 16
#> dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
#> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
#> 1 517 515 2 830 819 11 UA 1545 N14228 EWR
#> 2 533 529 4 850 830 20 UA 1714 N24211 LGA
#> 3 542 540 2 923 850 33 AA 1141 N619AA JFK
#> 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
#> 5 554 600 -6 812 837 -25 DL 461 N668DN LGA
#> 6 554 558 -4 740 728 12 UA 1696 N39463 EWR
#> 7 555 600 -5 913 854 19 B6 507 N516JB EWR
#> 8 557 600 -3 709 723 -14 EV 5708 N829AS LGA
#> 9 557 600 -3 838 846 -8 B6 79 N593JB JFK
#> 10 558 600 -2 753 745 8 AA 301 N3ALAA LGA
#> # … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#> # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
#> # ⁵sched_arr_time, ⁶arr_delay
Now we just have 16 columns, as opposed to the original 19.
Just a note. In the text we are using, the syntax is:
select(flights, -(year:day))
But based on this:
I get the sense that !
is the current recommendation for taking the complement in select()
statements and that -
is deprecated. And FYI: documentation always lags behind implementation.
We can do exactly the same thing by explicitly listing the columns we want to drop, but, there is a gotcha here.
This (which was my first guess) does not work:
select(flights, !(year, month, day)
Instead, we need to wrap the dropped columns in a c()
vector:
select(flights, !c(year, month, day)) #> # A tibble: 336,776 × 16
#> dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
#> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
#> 1 517 515 2 830 819 11 UA 1545 N14228 EWR
#> 2 533 529 4 850 830 20 UA 1714 N24211 LGA
#> 3 542 540 2 923 850 33 AA 1141 N619AA JFK
#> 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
#> 5 554 600 -6 812 837 -25 DL 461 N668DN LGA
#> 6 554 558 -4 740 728 12 UA 1696 N39463 EWR
#> 7 555 600 -5 913 854 19 B6 507 N516JB EWR
#> 8 557 600 -3 709 723 -14 EV 5708 N829AS LGA
#> 9 557 600 -3 838 846 -8 B6 79 N593JB JFK
#> 10 558 600 -2 753 745 8 AA 301 N3ALAA LGA
#> # … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#> # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
#> # ⁵sched_arr_time, ⁶arr_delay
This is the syntax for combining selections in the current dplyr package.
Let’s check if it works on positive selections:
select(flights, c(year, month, day))#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> 7 2013 1 1
#> 8 2013 1 1
#> 9 2013 1 1
#> 10 2013 1 1
#> # … with 336,766 more rows
Yep, all the same.
So it seems it’s optional for positive selections, but necessary for negative selections.
Selection helpers
There are other ways to select columns efficiently without explicitly naming them.
Here are some examples of helpers which select variables by pattern-matching over the names. Note that the search term must be wrapped in quotes (since we are searching on text in the column name):
flights %>%
select(ends_with("time"))#> # A tibble: 336,776 × 5
#> dep_time sched_dep_time arr_time sched_arr_time air_time
#> <int> <int> <int> <int> <dbl>
#> 1 517 515 830 819 227
#> 2 533 529 850 830 227
#> 3 542 540 923 850 160
#> 4 544 545 1004 1022 183
#> 5 554 600 812 837 116
#> 6 554 558 740 728 150
#> 7 555 600 913 854 158
#> 8 557 600 709 723 53
#> 9 557 600 838 846 140
#> 10 558 600 753 745 138
#> # … with 336,766 more rows
flights %>%
select(starts_with("time"))#> # A tibble: 336,776 × 1
#> time_hour
#> <dttm>
#> 1 2013-01-01 05:00:00
#> 2 2013-01-01 05:00:00
#> 3 2013-01-01 05:00:00
#> 4 2013-01-01 05:00:00
#> 5 2013-01-01 06:00:00
#> 6 2013-01-01 05:00:00
#> 7 2013-01-01 06:00:00
#> 8 2013-01-01 06:00:00
#> 9 2013-01-01 06:00:00
#> 10 2013-01-01 06:00:00
#> # … with 336,766 more rows
We can combine these two helpers (you can also think of them as ‘filters’ or ‘constraints’ to make contact with other programming languages) using logical operators. Here is an ‘OR’ statement using the |
syntax. It means the selection returns all selections that match either columns that end with “time” or start with “time”.
flights %>%
select(ends_with("time") | starts_with("time"))#> # A tibble: 336,776 × 6
#> dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
#> <int> <int> <int> <int> <dbl> <dttm>
#> 1 517 515 830 819 227 2013-01-01 05:00:00
#> 2 533 529 850 830 227 2013-01-01 05:00:00
#> 3 542 540 923 850 160 2013-01-01 05:00:00
#> 4 544 545 1004 1022 183 2013-01-01 05:00:00
#> 5 554 600 812 837 116 2013-01-01 06:00:00
#> 6 554 558 740 728 150 2013-01-01 05:00:00
#> 7 555 600 913 854 158 2013-01-01 06:00:00
#> 8 557 600 709 723 53 2013-01-01 06:00:00
#> 9 557 600 838 846 140 2013-01-01 06:00:00
#> 10 558 600 753 745 138 2013-01-01 06:00:00
#> # … with 336,766 more rows
In this particular case there is a more compact way to get the same result, using the contains()
helper. But this solution has lower resolution, since “time” could be anywhere in the column name.
flights %>%
select(contains("time"))#> # A tibble: 336,776 × 6
#> dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
#> <int> <int> <int> <int> <dbl> <dttm>
#> 1 517 515 830 819 227 2013-01-01 05:00:00
#> 2 533 529 850 830 227 2013-01-01 05:00:00
#> 3 542 540 923 850 160 2013-01-01 05:00:00
#> 4 544 545 1004 1022 183 2013-01-01 05:00:00
#> 5 554 600 812 837 116 2013-01-01 06:00:00
#> 6 554 558 740 728 150 2013-01-01 05:00:00
#> 7 555 600 913 854 158 2013-01-01 06:00:00
#> 8 557 600 709 723 53 2013-01-01 06:00:00
#> 9 557 600 838 846 140 2013-01-01 06:00:00
#> 10 558 600 753 745 138 2013-01-01 06:00:00
#> # … with 336,766 more rows
But you get the idea. The usefulness of these ‘selection helpers’ depends on the column naming conventions you create (or most likely inherit) from a colleague or online.
Break out exercises:
See 5.4.1 in the text.
But just state with these:
Create a couple of select queries which pull out dep_time, dep_delay, arr_time, and arr_delay from flights, just using what we learned above.
One should be explicit, and others (at least slightly!) more efficient.
If you get through these with no issues, feel free to explore more of the execises in 5.4.1.