S05E05: R for Data Science (2e) - Ch. 8 - Data Import, Part II

Today, we’ll continue with the R4DS on data import (and export)



Introduction

Today we will continue with the R for Data Science chapter 8 on importing data.

We will cover a few more tricks to import data with readr, and will also cover exporting data. We will talk about:

  1. Controlling column types when reading data,

  2. Reading data from multiple files, and

  3. Writing to a file.

We will again be using the tidyverse and janitor packages, so we first need make sure these packages are installed, and then load them for the current session using library() commands:

# If you don't have these installed:
#install.packages("tidyverse")
#install.packages("janitor")

# To load the packages
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#>  dplyr     1.1.0      readr     2.1.4
#>  forcats   1.0.0      stringr   1.5.0
#>  ggplot2   3.4.1      tibble    3.1.8
#>  lubridate 1.9.2      tidyr     1.3.0
#>  purrr     1.0.1     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
#>  Use the conflicted package to force all conflicts to become errorslibrary(janitor)
#> 
#> Attaching package: 'janitor'
#> 
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test


Book Chapter 8.3 and 8.4

Let’s switch to the book for this part.



Breakout Rooms I

We’ll use the 03-sales.csv file from the examples in the book. You can download it as follows to your current R working directory:

url_csv <- "https://github.com/biodash/biodash.github.io/raw/master/content/codeclub/S05E06/03-sales.csv"
download.file(url = url_csv, destfile = "03-sales.csv")

Exercise 1

Read the 03-sales.csv file, but with the following twist: read all columns as factors.

Hints(click here)

Use the col_types argument and within that, call the cols function and specify a .default.


Solution(click here)
read_csv("03-sales.csv",
         col_types = cols(.default = col_factor()))
#> # A tibble: 6 × 5
#>   month year  brand item  n    
#>   <fct> <fct> <fct> <fct> <fct>
#> 1 March 2019  1     1234  3    
#> 2 March 2019  1     3627  1    
#> 3 March 2019  1     8820  3    
#> 4 March 2019  2     7253  1    
#> 5 March 2019  2     8766  3    
#> 6 March 2019  2     8288  6


Book Chapter 8.5 and 8.6

Let’s switch back to the book for this part.



Breakout Rooms I

Exercise 2

Write the dataframe that you read in Exercise 1 to a CSV file. Recall that all columns in the dataframe are stored as factors.

Then, read the CSV file you just created back in, without specifying any additional arguments to read_csv().

Check whether the columns are still factors, and explain why.

Hints(click here)
  • Assign the initial read_csv() output to a dataframe, then use write_csv() to write it to a CSV file.

  • Recall that a CSV file is a plain text file. Can a plain text file store “metadata” about column types?


Solution(click here)
sales <- read_csv("03-sales.csv",
                  col_types = cols(.default = col_factor()))

write_csv(sales, "sales.csv")

read_csv("sales.csv")
#> Rows: 6 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): month
#> dbl (4): year, brand, item, n
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.#> # A tibble: 6 × 5
#>   month  year brand  item     n
#>   <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 March  2019     1  1234     3
#> 2 March  2019     1  3627     1
#> 3 March  2019     1  8820     3
#> 4 March  2019     2  7253     1
#> 5 March  2019     2  8766     3
#> 6 March  2019     2  8288     6

When we read the file back in, the columns are no longer factors but characters and numerics, because this sort of information is lost when writing to a plain text file.


Exercise 3

Repeat what you did in Exercise 2, but now write to and read from an RDS file.

Again, check whether the columns are still factors, and explain why.

Hints(click here)

Use the write_rds() and read_rds() functions.


Solution(click here)
sales <- read_csv("03-sales.csv",
                  col_types = cols(.default = col_factor()))

write_rds(sales, "sales.rds")

read_rds("sales.rds")
#> # A tibble: 6 × 5
#>   month year  brand item  n    
#>   <fct> <fct> <fct> <fct> <fct>
#> 1 March 2019  1     1234  3    
#> 2 March 2019  1     3627  1    
#> 3 March 2019  1     8820  3    
#> 4 March 2019  2     7253  1    
#> 5 March 2019  2     8766  3    
#> 6 March 2019  2     8288  6

The columns are still factors because RDS files preserve all information about R objects, including column type information.


Stephen Opiyo
Stephen Opiyo
Biostatistician at MCIC