S04E09: R for Data Science - Chapter 5.1 - 5.2
Data Transformation with dplyr, part 1: introduction, and filter()
I – Chapter 5.1: Introduction
Key points
-
Function name conflicts: The function
filter()
in the stats package (which is loaded by default in R) will be “masked” / “overwritten” by dplyr’sfilter()
function when you load the tidyverse. To still use a masked function (or a function from an installed-but-not-loaded package!), use the “full” notation, e.g.stats::filter()
. -
A data frame is rectangular data structure (with rows and columns), while a “tibble” is a tidyverse-style data frame. Tibbles mainly differ from regular data frames in how they are printed to screen by default. See the two examples below:
cars
is a regular data frame andflights
is a tibble. -
The most common R data types are integers (tibble abbreviation:
int
), doubles (dbl
), character strings (chr
), logicals (lgl
), and factors (fctr
). -
The dplyr package is designed to work with dataframes: both the input and the output is a dataframe.
# 'mtcars' is a regular dataframe
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# 'flights' is a tibble, which affects its printing behavior
library(nycflights13)
head(flights)
#> # A tibble: 6 × 19
#> year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 1 1 517 515 2 830 819 11 UA
#> 2 2013 1 1 533 529 4 850 830 20 UA
#> 3 2013 1 1 542 540 2 923 850 33 AA
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
#> 6 2013 1 1 554 558 -4 740 728 12 UA
#> # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
II – Chapter 5.2: filter()
Key points
-
The
filter()
function removes rows (observations) from a dataframe based on certain conditions. You specify those conditions for one or more columns. -
When you refer to a column, don’t quote its name! (e.g. in
filter(flights, month == 1)
, wheremonth
is the column name.) -
Use “comparison operators” like
>
(greater than) to specify conditions. Note that two equals signs==
(and not a single,=
) signifies equality, and that!=
means “does not equal”. -
To combine multiple conditions, use logical (Boolean) operators:
&
(and),|
(or), and!
(not). Separating conditions by a comma also means “and” in dplyr, e.g. infilter(flights, month == 1, day == 1)
. -
The
%in%
operator tests if the value(s) on the left-hand side are contained in the values on the right hand side, e.g.4 %in% 1:5
asks whether 4 is contained in the sequence of numbers from 1 to 5, which will returnTRUE
. -
Missing values are denoted by
NA
, and almost any operation with anNA
will return anotherNA
. To test ifx
is or containsNA
s, don’t usex == NA
butis.na(x)
. When you filter based on a column, rows withNA
s in that column will by default be removed byfilter()
.
Exercise 1
Find all flights that…
- Had an arrival delay of two or more hours
- Flew to Houston (
IAH
orHOU
) - Were operated by United (
UA
), American (AA
), or Delta (DL
) - Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
- Departed between midnight and 6am (inclusive)
Before you start, load the necessary packages:
Hints (click here)
-
Delays are given in minutes.
-
Times of day are numbered from
0001
(1 minute past midnight) to2400
(midnight).
Solution (click here)
In the solutions below, I am piping the output to nrow()
, so you can check if you got the same number of rows.
\1. Had an arrival delay (=> arr_delay
) of two or more hours (=> >= 120
):
#> [1] 10200
\2. Flew to Houston (IAH
or HOU
) – destination is the dest
column:
#> [1] 9313
\3. Were operated by United (UA
), American (AA
), or Delta (DL
) — this information is in the carrier
column:
#> [1] 139504
\4. Departed in summer (July, August, and September) — use the month
column:
#> [1] 86326
This would also work:
\5. Arrived more than two hours late, but didn’t leave late — use the arr_delay
(arrival delay) and dep_delay
(departure delay) columns:
#> [1] 29
\6. Were delayed by at least an hour, but made up over 30 minutes in flight — use the dep_delay
and arr_delay
columns, and note that “making up over 30 miniutes” implies that the arrival delay was more than 30 minutes smaller than the departure delay:
#> [1] 1844
\7. Departed between midnight and 6am (inclusive) — use the dep_time
column and note that 2400
is midnight:
#> [1] 9373
Exercise 3
How many flights have a missing dep_time
? What other variables are missing for these flights? What might these rows represent?
Hints (click here)
- A “missing”
dep_time
means that this cell contains the valueNA
. - Recall that you can test if something is
NA
with theis.na()
function! - To count the number of flights, you can look at the information printed along with the dataframe (
... with X more rows
), or pipe (%>%
) the dataframe into thenrow()
function, which counts the number of rows.