S05E05: R for Data Science (2e) - Ch. 8 - Data Import, Part II
Today, we’ll continue with the R4DS on data import (and export)
Introduction
Today we will continue with the R for Data Science chapter 8 on importing data.
We will cover a few more tricks to import data with readr, and will also cover exporting data. We will talk about:
-
Controlling column types when reading data,
-
Reading data from multiple files, and
-
Writing to a file.
We will again be using the tidyverse
and janitor
packages, so we first need make sure these packages are installed, and then load them for the current session using library()
commands:
# If you don't have these installed:
#install.packages("tidyverse")
#install.packages("janitor")
# To load the packages
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.0 ✔ readr 2.1.4
#> ✔ forcats 1.0.0 ✔ stringr 1.5.0
#> ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
#> ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
#> ✔ purrr 1.0.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package to force all conflicts to become errorslibrary(janitor)
#>
#> Attaching package: 'janitor'
#>
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
Book Chapter 8.3 and 8.4
Let’s switch to the book for this part.
Breakout Rooms I
We’ll use the 03-sales.csv
file from the examples in the book. You can download it as follows to your current R working directory:
url_csv <- "https://github.com/biodash/biodash.github.io/raw/master/content/codeclub/S05E06/03-sales.csv"
download.file(url = url_csv, destfile = "03-sales.csv")
Exercise 1
Read the 03-sales.csv
file, but with the following twist: read all columns as factors.
Hints(click here)
Use the col_types
argument and within that, call the cols
function and specify a .default
.
Solution(click here)
read_csv("03-sales.csv",
col_types = cols(.default = col_factor()))
#> # A tibble: 6 × 5
#> month year brand item n
#> <fct> <fct> <fct> <fct> <fct>
#> 1 March 2019 1 1234 3
#> 2 March 2019 1 3627 1
#> 3 March 2019 1 8820 3
#> 4 March 2019 2 7253 1
#> 5 March 2019 2 8766 3
#> 6 March 2019 2 8288 6
Book Chapter 8.5 and 8.6
Let’s switch back to the book for this part.
Breakout Rooms I
Exercise 2
Write the dataframe that you read in Exercise 1 to a CSV file. Recall that all columns in the dataframe are stored as factors.
Then, read the CSV file you just created back in, without specifying any additional arguments to read_csv()
.
Check whether the columns are still factors, and explain why.
Hints(click here)
-
Assign the initial
read_csv()
output to a dataframe, then usewrite_csv()
to write it to a CSV file. -
Recall that a CSV file is a plain text file. Can a plain text file store “metadata” about column types?
Solution(click here)
sales <- read_csv("03-sales.csv",
col_types = cols(.default = col_factor()))
write_csv(sales, "sales.csv")
read_csv("sales.csv")
#> Rows: 6 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): month
#> dbl (4): year, brand, item, n
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.#> # A tibble: 6 × 5
#> month year brand item n
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 March 2019 1 1234 3
#> 2 March 2019 1 3627 1
#> 3 March 2019 1 8820 3
#> 4 March 2019 2 7253 1
#> 5 March 2019 2 8766 3
#> 6 March 2019 2 8288 6
When we read the file back in, the columns are no longer factors but characters and numerics, because this sort of information is lost when writing to a plain text file.
Exercise 3
Repeat what you did in Exercise 2, but now write to and read from an RDS
file.
Again, check whether the columns are still factors, and explain why.
Solution(click here)
sales <- read_csv("03-sales.csv",
col_types = cols(.default = col_factor()))
write_rds(sales, "sales.rds")
read_rds("sales.rds")
#> # A tibble: 6 × 5
#> month year brand item n
#> <fct> <fct> <fct> <fct> <fct>
#> 1 March 2019 1 1234 3
#> 2 March 2019 1 3627 1
#> 3 March 2019 1 8820 3
#> 4 March 2019 2 7253 1
#> 5 March 2019 2 8766 3
#> 6 March 2019 2 8288 6
The columns are still factors because RDS files preserve all information about R objects, including column type information.