Session 5: ggplot2, round 2

Getting up close and personal with our data

Artwork by @allison_horst





Prep homework

Basic computer setup

  • If you didn’t already do this, please follow the Code Club Computer Setup instructions, which also has pointers for if you’re new to R or RStudio.

  • If you’re able to do so, please open RStudio a bit before Code Club starts – and in case you run into issues, please join the Zoom call early and we’ll troubleshoot.

New to ggplot?

Check out the last Code Club Session 4 on Visualizing Data.

If you’ve never used ggplot2 before (or even if you have), you may find this cheat sheet useful.



Getting Started

Script for today’s session

# directory for Code Club Session 2:
dir.create("S05")

# directory for our script
# ("recursive" to create two levels at once.)
dir.create("S05/scripts/")

# save the url location for today's script
todays_R_script <- 'https://raw.githubusercontent.com/biodash/biodash.github.io/master/content/codeclub/05_ggplot-round-2/Session5_ggplot2.R'

# indicate the name of the new script file
Session5_script <- "S05/scripts/Session5_script.R"

# go get that file! 
download.file(url = todays_R_script,
              destfile = Session5_script)


1 - Why visualize our data?

ggplot2 exploratory cartoon

Artwork by Allison Horst

We make data visualizations for two main reasons:

  1. To explore our data
  2. To share our data with others

Often, we think about figure generation as the last part of the scientic process, something you do as you prepare a manuscript for publication. I hope to convince you that exploring your data, and making exploratory plots is a critical part of the data analysis and interpretation process.

Today we will be using ggplot2 to make a series of plots that help us better understand the underlying structure in our dataset.

When summary statistics don’t cut it

This “Datasaurus Dozen” shows the value of looking at your data beyond means and standard deviations. In the gif above, created by Alberto Cairo, each of these 13 datasets have identical means, standard eviations, and correlations to two decimal places. And one of the datasets is a dinosaur!


What will we go over today

These geoms will help you to get better acquainted with your data.

  • geom_col() - makes bar plots. I will show you how to do this and then recommend that you don’t.
  • geom_boxplot() - makes infinitely useful boxplots.
  • geom_violin() - makes violin plots, a hybrid between a boxplot and a density plot. Very musical.
  • geom_density_ridges() - a density plot giving you the impression of a side view of a mountain range. Requires the package ggridges
  • geom_jitter() - adds all datapoints to your plot, and jitters them to handle overplotting.

I will also go over a few tricks along the way, including coord_flip(), adding labels using labs(), and changing the overall look of the plot with theme(), or pre-set themes like theme_classic() which is my go-to.



2 - Accessing our data

Let’s get set up and grab some data so that we can learn more about penguins (and ggplot2)

  • You can do this locally, or at OSC. You can find instructions if you are having trouble here.

First load your libraries.

Then let’s access the wintry palmerpenguins dataset. We will then look at penguins, the dataset we will be using for the first part of today’s Code Club. This data is collected on penguins from the Palmer Station Antarctica Long-Term Ecological Research study area.

cartoon of the palmer penguins

Artwork by Allison Horst

install.packages("palmerpenguins")

Let’s look at the data in penguins.

# look at the first 6 rows, all columns
head(penguins)

#> # A tibble: 6 x 8
#>   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
#>   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
#> 1 Adelie  Torge…           39.1          18.7              181        3750 male 
#> 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
#> 3 Adelie  Torge…           40.3          18                195        3250 fema…
#> 4 Adelie  Torge…           NA            NA                 NA          NA NA   
#> 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
#> 6 Adelie  Torge…           39.3          20.6              190        3650 male 
#> # … with 1 more variable: year <int>


# check the structure
# this tell us what is contained within our df
glimpse(penguins)

#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
#> $ sex               <fct> male, female, female, NA, female, male, female, mal…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…

This dataset contains the following measurements about penguins at Palmer Station in Antarctica:

  • species
  • island
  • bill_length_mm
  • bill_depth_mm
  • flipper_length_mm
  • body_mass_g
  • sex
  • year

We are going to be plotting to get an understanding of bill_length_mm which is the length of the bill from the penguins face, protruding outwards (and more easily understood in the image below).

depiction of bill length protruding from the penguins face, and bill depth, the height of the bill parallel to the ground

Artwork by Allison Horst



3 - Removing NAs

Sometimes you will have NAs (or missing data). That might be informative to you, but here we are going to remove missing data using drop_na(), and assign it to a new dataframe called penguins_noNA.

# check dimensions of penguins
dim(penguins)

#> [1] 344   8


# remove NAs
penguins_noNA <- penguins %>%
  drop_na()

dim(penguins_noNA) # we have removed 11 observations

#> [1] 333   8

Note - by removing NAs, we have gotten rid of 11 observations



4 - Bar charts with geom_col() and stat_summary()

Often, people use bar charts, representing the height or the length of the bar as proportional to the average value that it represents. These charts are sometimes called dynamite plots because they resemble (when they have an error bar with whisker) those cartoon style dynamite sticks. Pow!

However, these bar charts, even if you add a standard deviation/error, really can hide the true distribution of your data, and for this reason, I and others hope you don’t select to make them.

I hope after today, you see that there is always a better chart type to make than a bar chart. But I will show you how to make them anyway.

Before we plot, let’s calculate some summary statistics so we know what we should expect.

# calculating mean bill_length_mm by species
penguins_noNA %>%
  group_by(species) %>%
  summarize(mean_bill_length = mean(bill_length_mm))

#> `summarise()` ungrouping output (override with `.groups` argument)

#> # A tibble: 3 x 2
#>   species   mean_bill_length
#>   <fct>                <dbl>
#> 1 Adelie                38.8
#> 2 Chinstrap             48.8
#> 3 Gentoo                47.6

Just calling geom_col() does not give us what we want. Look at the y-axis scale and how out of line this is with our summary statistics.

# bar plot with geom_col()
# this is wrong!
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm)) +
  geom_col()

Using geom_col() the right way.

# bar plot, the right way with geom_col()
penguins_noNA %>%
  group_by(species) %>%
  summarize(mean_bill_length = mean(bill_length_mm)) %>%
  ggplot(aes(x = species, y = mean_bill_length)) +
  geom_col()

#> `summarise()` ungrouping output (override with `.groups` argument)

# or you could do this in a less bulky way with stat_summary()
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm)) +
  stat_summary(fun = "mean", geom = "bar")



5 - Boxplots with geom_boxplot()

A boxplot has the benefit of showing you more than the median and the standard deviation, so you can better see the true distribution of your data. In geom_boxplot():

  • lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * IQR
  • lower hinge/bottom line of box part of boxplot = 25% quantile
  • middle = median, 50% quantile
  • upper hinge/top line of box part of boxplot = 75% quantile
  • upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR
# vertical boxplot
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm)) +
  geom_boxplot()

Adding coord_flip() makes your vertical boxplot horizontal. You could do the same thing by flipping the variables on the x and y mappings.

# horizontal boxplot
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm)) +
  geom_boxplot() +
  coord_flip()

Look at how much more information we have here than in our bar plots!



5 - Violin plots with geom_violin()

A violin plot is boxplot-esque, but shows a mirrored density distribution. This type of plot is useful when you are trying to particularly show data distribution.

Note here I have also mapped species to color, within the aes statement so it will apply globally to this plot.

# violin plot
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm, color = species)) +
  geom_violin()

Adding geom_point() lets you add another layer of all the actual data points, on top of your violin plot. Remember that this is inherent in the design of ggplot2, that you can layer your plots, of different types, on top of each other.

# violin plot with data points overlaid
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
  geom_violin() +
  geom_point()

Note, I am now mapping species to fill instead of color. See the difference?

This doesn’t look too good because of overplotting, i.e., the smear of datapoints that doesn’t give you much information about distribution.

We can add geom_jitter() to introduce some small amount of randomness to our points to make us able to see them better. Seeing all your data points also lets the reader easily get a sense of your sample size.

# violin plot with data points jittered
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
  geom_violin() +
  geom_jitter()

geom_jitter() is a specialized version of geom_point(), but you could replace the geom_jitter() call with geom_point(position = "jitter) and get the same result. You can also use geom_point(position = position_jitterdodge()) if you only want jitter in the x, and don’t want any jitter in the y.

Wow, we now have so much more information about our data!



6 - Dot plots with geom_dotplot()

A dot plot plots each individual datapoint, and can stack how you like. These look a lot like the SigmaPlot plots to me.

  • binaxis can be set to “x” or “y”
  • stackdir indicates how to stack the dots: “up” (default), “down”, “center”, “centerwhole” (centered, but with dots aligned)
  • dotsize indicates the size of the dots, with 1 as default
# dotplot
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
  geom_dotplot(binaxis = "y", stackdir = "center", dotsize = 0.5)

#> `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.



7 - Density ridge plots with geom_density_ridges()

A density ridge plots with geom_density_ridges() requires the packages ggridges, and make multiple density plots in a staggered orientation.

You can adjust scale within geom_density_ridges() to adjust the size of each density plot, though I have left it on the default. Adding alpha sets transparency.

# install.packages("ggridges")
library(ggridges)
penguins_noNA %>%
  ggplot(aes(x = bill_length_mm, y = species, fill = species)) +
  geom_density_ridges(alpha = 0.8)

#> Picking joint bandwidth of 1.08



8 - ggplot is made for layering!

I have shown you a bunch of different plot types, and you can combine many of them together. Here is an example of combining geom_violin() and geom_jitter(), while mapping new variables to aesthetics.

penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm, 
             color = sex, 
             shape = island, 
             group = species)) +
  geom_violin() +
  geom_jitter(position = position_jitterdodge(jitter.width = 2))



9 - Increase clarity and visual appeal

We can quickly make our plot:

  • prettier by setting a theme
  • more clear by setting plot labels (eg., axes, titles, legend) with labs
penguins_noNA %>%
  ggplot(aes(x = species, y = bill_length_mm, 
             color = sex, 
             shape = island, 
             group = species)) +
  geom_violin() +
  geom_jitter(position = position_jitterdodge(jitter.width = 2),
              alpha = 0.7) +
  theme_classic() +
  labs(title = "Penguin Bill Length by Species, Sex and Location",
       subtitle = "Collected at Palmer Station, Antarctica",
       x = "Penguin Species", # x axis label
       y = "Bill length (mm)", #  y axis label
       color = "Sex", # legend title
       shape = "Island") # legend title



10 - Breakout rooms!

Main exercises

Get data

We are going to use the NHANES dataset we used in Session 3 on joining. What was that data about again? Let’s refresh our memory.

library(NHANES)
knitr::kable(head(NHANES)) 

ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome HHIncomeMid Poverty HomeRooms HomeOwn Work Weight Length HeadCirc Height BMI BMICatUnder20yrs BMI_WHO Pulse BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3 Testosterone DirectChol TotChol UrineVol1 UrineFlow1 UrineVol2 UrineFlow2 Diabetes DiabetesAge HealthGen DaysPhysHlthBad DaysMentHlthBad LittleInterest Depressed nPregnancies nBabies Age1stBaby SleepHrsNight SleepTrouble PhysActive PhysActiveDays TVHrsDay CompHrsDay TVHrsDayChild CompHrsDayChild Alcohol12PlusYr AlcoholDay AlcoholYear SmokeNow Smoke100 Smoke100n SmokeAge Marijuana AgeFirstMarij RegularMarij AgeRegMarij HardDrugs SexEver SexAge SexNumPartnLife SexNumPartYear SameSex SexOrientation PregnantNow
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51625 2009_10 male 4 0-9 49 Other NA NA NA 20000-24999 22500 1.07 9 Own NA 17.0 NA NA 105.4 15.30 NA 12.0_18.5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
51630 2009_10 female 49 40-49 596 White NA Some College LivePartner 35000-44999 40000 1.91 5 Rent NotWorking 86.7 NA NA 168.4 30.57 NA 30.0_plus 86 112 75 118 82 108 74 116 76 NA 1.16 6.70 77 0.094 NA NA No NA Good 0 10 Several Several 2 2 27 8 Yes No NA NA NA NA NA Yes 2 20 Yes Yes Smoker 38 Yes 18 No NA Yes Yes 12 10 1 Yes Heterosexual NA
51638 2009_10 male 9 0-9 115 White NA NA NA 75000-99999 87500 1.84 6 Rent NA 29.8 NA NA 133.1 16.82 NA 12.0_18.5 82 86 47 84 50 84 50 88 44 NA 1.34 4.86 123 1.538 NA NA No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# kable just formats as a scrollable table for this website
# you can just use head(NHANES) or glimpse(NHANES)

Exercise 1

Create a new data frame includes the NHANES data only from individuals that are 20 years of age or older, and removes observations where there are NAs for either age subdivided by decade (AgeDecade) or total cholesterol (TotChol).

Hints (click here)

Try using a series of filter() statements. Remember, you can tell filter what you want, or what you don’t want. You can filter for if specific variables have NAs by using is.na() on your variable of interest. Also remember that ! means “not.” You will notice that if you want to use drop_NA() you need to specific which specific variables you want to use, or you will inadvertably drop a lot of observations which have missing data for variables other than those we are plotting..


Solutions (click here)
# here are a few ways to do this
NHANES_over20_noNA <- NHANES %>%
  filter(Age >20) %>%
  drop_na(AgeDecade, TotChol)

dim(NHANES_over20_noNA)

#> [1] 6408   76


NHANES_over20_noNA <- NHANES %>%
  filter(Age >20,
         !is.na(AgeDecade),
         !is.na(TotChol))

dim(NHANES_over20_noNA)

#> [1] 6408   76


Exercise 2

Create a boxplot to show the relationship between total cholesterol (TotChol) and age (AgeDecade).

Hints (click here)

Try geom_boxplot(). Map your variables of interest to the x and y aesthetics. Which you variable you put on x and y will determine if your boxplot is vertical or horizontal.


Solutions (click here)
NHANES_over20_noNA %>%
  ggplot(aes(x = AgeDecade, y = TotChol)) +
  geom_boxplot()



Exercise 3

Take your plot from Exercise 2 and make it a violin plot instead of a boxplot. Then color by age.

Hints (click here)

The geom for a violin plot is geom_violin(). You can change color by mapping to color or to fill.


Solutions (click here)

Note the difference between mapping to color vs. fill.

NHANES_over20_noNA %>%
  ggplot(aes(x = AgeDecade, y = TotChol, color = AgeDecade)) +
  geom_violin()

NHANES_over20_noNA %>%
  ggplot(aes(x = AgeDecade, y = TotChol, fill = AgeDecade)) +
  geom_violin()



Exercise 4

Make add a boxplot to your violin plot from Exercise 3. Adjust the parameters so you the plot looks good to you.

Hints (click here)

In geom_boxplot(), you can adjust the width of the boxplot by setting width = X. A width of 1 is the default.


Solutions (click here)
NHANES_over20_noNA %>%
  ggplot(aes(x = AgeDecade, y = TotChol, color = AgeDecade)) +
  geom_violin() +
  geom_boxplot(width = 0.2)



Exercise 5

Add all of the data points on top of your boxplot from Exercise 2 of total cholesterol by age. Adjust the parameters so you the plot looks good to you. While you are at it, clean up your plot labels and give your plot a title.

Hints (click here)

Remember that ggplot layers your plots, so layers that are further down in your code, will be applied on top of those that come earlier.


Solutions (click here)

geom_boxplot(outlier.shape = NA) removes the outliers from geom_boxplot(), since we are plotting all of the points, we do not want the outliers appearing twice.

NHANES_over20_noNA %>%
  ggplot(aes(x = AgeDecade, y = TotChol, color = AgeDecade)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_jitter(width = 0.3, alpha = 0.1) +
  labs(title = "Total Cholesterol by Age",
       subtitle = "Data from the National Health and Nutrition Examination Survey (NHANES)",
       x = "Age, by Decade",
       y = "Total Cholesterol, mmol/L",
       color = "Age (years)")



Bonus exercises

Bonus 1

Make a density ridge plot for age by total cholesterol.

Hints (click here)

Try geom_density_ridges(), and remember, this is not a part of ggplot2, so be sure to call library(ggridges).


Solutions (click here)
# install.packages("ggridges")
library(ggridges)
NHANES_over20_noNA %>%
  ggplot(aes(x = TotChol, y = AgeDecade, fill = AgeDecade)) +
  geom_density_ridges(alpha = 0.7) 

#> Picking joint bandwidth of 0.224



Bonus 2

Take your density ridge plot from Bonus 1, and try applying a theme from hrbrthemes to it.

Hints (click here)

hrbrthemes is not part of ggplot2 so remember to install the package, and then call library(hrbrthemes). You can google the package to see what all your theme options are. I like theme_ipsum_rc(), try that one if you like!


Solutions (click here)
# install.packages("hrbrthemes")
library(hrbrthemes)

#> NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

#>       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

#>       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

NHANES_over20_noNA %>%
  ggplot(aes(x = TotChol, y = AgeDecade, fill = AgeDecade)) +
  geom_density_ridges(alpha = 0.7, scale = 0.9) +
  theme_ipsum_rc() 

#> Picking joint bandwidth of 0.224



Bonus 3

Tidy up your plot from Bonus 2 by giving it a title, axis labels, and try adding the median total cholesterol to each density ridge plot.

Hints (click here)

Using stat_summary() will help you add the median.


Solutions (click here)
  • theme(axis.title.x = element_text(hjust = 0.5)) makes the x-axis title center justified.
  • you can change shape within stat_summary() to be anything you like, either an R shape, a specific keyboard key, or even a pasted emoji. The default is a point.
  • when you set a theme(), anything that comes below will override what code comes previous, so for this reason, if you are going to amend a pre-made theme, first call the pre-made theme, and then make any changes you like below.
NHANES_over20_noNA %>%
  ggplot(aes(x = TotChol, y = AgeDecade, fill = AgeDecade)) +
  geom_density_ridges(alpha = 0.7, scale = 0.9) +
  stat_summary(fun = median) +
  theme_ipsum_rc() +
  theme(axis.title.x = element_text(hjust = 0.5),
        axis.title.y = element_text(hjust = 0.5)) +
  labs(title = "Total Cholesterol by Age",
       subtitle = "Data from the National Health and Nutrition Examination Survey (NHANES)",
       x = "Total Cholesterol, mmol/L",
       y = "Age, by Decade",
       fill = "Age (years)")

#> Picking joint bandwidth of 0.224

#> Warning: Removed 6 rows containing missing values (geom_segment).



Bonus 4

Commonly used cutoffs for cholesterol are: < 5.2 mmol/L is normal, 5.2-6.2 mmol/L is borderline high and > 6.2 mmol is high. Add a vertical cutoff line showing the level below which cholesterol would be considered normal.

Hints (click here)

Using geom_vline() will let you add a vertical line with an xintercept that is appropriate.


Solutions (click here)
NHANES_over20_noNA %>%
  ggplot(aes(x = TotChol, y = AgeDecade, fill = AgeDecade)) +
  geom_density_ridges(alpha = 0.7, scale = 0.9) +
  stat_summary(fun = median) +
  geom_vline(aes(xintercept = 5.2)) +
  theme_ipsum_rc() +
  theme(axis.title.x = element_text(hjust = 0.5),
        axis.title.y = element_text(hjust = 0.5)) +
  labs(title = "Total Cholesterol by Age",
       subtitle = "Data from the National Health and Nutrition Examination Survey (NHANES)",
       caption = "Vertical line indicates upper limit of normal cholesterol",
       x = "Total Cholesterol, mmol/L",
       y = "Age, by Decade",
       fill = "Age (years)")

#> Picking joint bandwidth of 0.224

#> Warning: Removed 6 rows containing missing values (geom_segment).




Jessica Cooperstone
Jessica Cooperstone
Assistant Professor at HCS