Session 11: Spatial data visualization with ggplot2
Prep and setup
New to Code Club?
-
If you didn’t already do this, please follow the Code Club Computer Setup instructions, which also has pointers for if you’re new to R or RStudio.
-
If you’re able to do so, please open RStudio a bit before Code Club starts – and in case you run into issues, please join the Zoom call early and we’ll help you troubleshoot.
New to ggplot2?
Check out the two Code Club pages for Session 4 and Session 5.
If you’ve never used ggplot2
before (or even if you have), you may find this cheat sheet useful.
Download the CSV file on Ohio
ohio_csv <-
'https://raw.githubusercontent.com/biodash/biodash.github.io/master/content/codeclub/11_ggplot-maps/Ohio.csv'
download.file(url = ohio_csv, destfile = 'Ohio.csv')
Creating maps in R
Today, we will cover the visualization of spatial data in R using the layered grammar of graphics implementation of ggplot2 in conjunction with the contextual information of static maps from world maps in the maps package.
Before we look at mapping using ggplot2, let us define some terms.
Areal data
Areal data is data which corresponds to geographical extents with polygonal boundaries.
The layered grammar of graphics
By definition, the layered grammar demands that every plot consist of five components:
-
a default dataset with aesthetic mappings,
-
one or more layers, each with either a geometric object (“geom”), a statistical transformation (“stat”), etc.
-
a scale for each aesthetic mapping (which can be automatically generated),
-
a coordinate system, and
-
a facet specification.
Since ggplot2 is an implementation of the layered grammar of graphics, every plot made with ggplot2 has each of the above elements. Consequently, map plots also have these elements, but certain elements are fixed to map components: the x
aesthetic is fixed to longitude, the y
aesthetic is fixed to latitude.
Drawing a map
Drawing a map in R requires two things. First, we have to draw the map using data that directs R to draw the polygon shapes that constitute the map. Then we add information to our map to plot color and marks. It’s the same basic logic that we have used in ggplot figures. The key thing is to have datasets that link that geographic data with the information that we want to put on the plot.
The maps package in R
The “maps” package in R contains a set of maps of the United States and the world drawn using longitude and latitude data. With world map, the USA map with the individual states you can accomplish a lot of the mapping tasks using the maps package. The maps package contains a lot of outlines of continents, countries, states, and counties
Making dataframes from map outlines by ggplot2
Recall that ggplot2 operates on dataframes. Therefore, we need some way to translate the maps data into a data frame format the ggplot can use. The package ggplot2 provides the map_data()
function. The function turns a series of points along an outline into a data frame of those points. The package ggplot2 uses the following syntax: map_data("name")
where “name” is a quoted string of the name of a map in the maps package.
Our first maps
Let us start by drawing maps of the World, USA, states, Ohio, Ohio and Indiana, and part of Europe using the maps package.
library(tidyverse)
library(maps)
library(scales)
library(stringr)
# Let us get a world map using the "map_data" function
world <- map_data("world")
## Let us get a US map:
usa <- map_data("usa")
# Let us get the states:
states <- map_data("state")
# Select Ohio using the filter function:
ohio <- states %>%
filter(region == "ohio")
- Let us plot a world map:
ggplot(data = world,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black")
- Let us plot a map of the US:
ggplot(data = usa,
mapping = aes(x = long, y= lat, group = group)) +
geom_polygon(fill = "white", color = "black")
- Let us plot a map of the US with states:
ggplot(data = states,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "blue", color = "black")
- Let us plot a map of Ohio:
ggplot(data = ohio,
mapping=aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "green")
- We can also plot a map for an arbitrary selection of states:
# We can select data for two states, for example Ohio and Indiana:
ohio_indiana <- states %>%
filter(region == "ohio" | region == "indiana")
# Plot the map of Ohio and Indiana:
ggplot(data = ohio_indiana,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "green" , color = "red")
- We can also plot only a specific region by filtering by latitude and longitude:
world <- map_data("world")
a_region <- filter(world, long >- 10 & long < 15.1 & lat > 32 & lat < 55)
ggplot(data = a_region,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black") +
coord_fixed(1.3)
The structure of the dataframe ohio
.
head(ohio)
#> long lat group order region subregion
#> 1 -80.51776 40.64563 42 10440 ohio <NA>
#> 2 -80.55787 40.63990 42 10441 ohio <NA>
#> 3 -80.62089 40.63417 42 10442 ohio <NA>
#> 4 -80.66100 40.61698 42 10443 ohio <NA>
#> 5 -80.66673 40.60552 42 10444 ohio <NA>
#> 6 -80.67245 40.58833 42 10445 ohio <NA>
Look at the variables in ohio, note what they refer to:
-
long
= longitude. Lines of longitude, or meridians, run between the North and South Poles and measure east-west positions. While prime meridian is assigned the value of 0 degrees, and runs through Greenwich (England), meridians to the west of the prime meridian are measured in degrees west (up to 180 degrees) and those to the east of the prime meridian are measured to in degrees east (up to 180 degrees). -
lat
= latitude. Lines of latitude measure north-south position between the poles with the equator defined as 0 degrees, the North Pole defined as 90 degrees north, and the South Pole defined as 90 degrees south. -
group
= an identifier that is unique for each subregion (here the counties). Group is very important! ggplot2’s functions can take a group argument which controls (amongst other things) whether adjacent points should be connected by lines. If they are in the same group, then they get connected, but if they are in different groups then they don’t. -
order
= an identifier that indicates the order in which the boundary lines should be drawn -
region
= string indicator for regions (here the states) -
subregion
= string indicator for sub-regions (here the county names)
Add information to the maps
The second part of mapping in R, is to add information on the map created in the first part.
In drawing the map, the “maps” package creates the backbone for visualizations. Then we add additional information to show colors and shapes.
We will:
- fill a map by region,
- draw a Bubble map using city population,
- make a point for every city,
- vary size of point by city size and vary the color of the dots, and
- add external data to the map.
- Let us fill by region and make sure the the lines of state borders are white:
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group),
color = "white") +
coord_fixed(1.3) +
guides(fill = FALSE) # Do this to omit the legend
- Let us draw a “Bubble map”:
# The maps package has city data
head(maps::world.cities)
#> name country.etc pop lat long capital
#> 1 'Abasan al-Jadidah Palestine 5629 31.31 34.34 0
#> 2 'Abasan al-Kabirah Palestine 18999 31.32 34.35 0
#> 3 'Abdul Hakim Pakistan 47788 30.55 72.11 0
#> 4 'Abdullah-as-Salam Kuwait 21817 29.36 47.98 0
#> 5 'Abud Palestine 2456 32.03 35.07 0
#> 6 'Abwein Palestine 3434 32.03 35.20 0
my_cities <- maps::world.cities
usa_cities <- filter(my_cities,country.etc == "USA")
head(usa_cities)
#> name country.etc pop lat long capital
#> 1 Abilene USA 113888 32.45 -99.74 0
#> 2 Akron USA 206634 41.08 -81.52 0
#> 3 Alameda USA 70069 37.77 -122.26 0
#> 4 Albany USA 45535 44.62 -123.09 0
#> 5 Albany USA 75510 31.58 -84.18 0
#> 6 Albany USA 93576 42.67 -73.80 0
- Make a point for every city:
ggplot(data = usa,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = usa_cities, color = "red",
aes(x = long, y = lat, group = NULL))
- Let’s pick just the big cities:
usa_big_cities <- filter(my_cities, country.etc == "USA" & pop > 500000)
ggplot(data = usa, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = usa_big_cities, color = "red",
aes(x = long, y = lat, group = NULL))
- Vary size of point by city size:
ggplot(data = usa, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = usa_big_cities, color = "red",
aes(x = long, y = lat, group = NULL, size = pop))
- Now vary the color of the dots:
usa_big_cities$qual <- sample(LETTERS[1:5], nrow(usa_big_cities),
replace = TRUE)
ggplot(data = usa, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = usa_big_cities,
aes(x = long, y = lat, group = NULL, color = qual, size = pop))
- Tweak the map:
# No scientific notation in legend r ggplot:
# scales package adds the "scale_size_continuous" function, and we can set label=comma
library(scales)
# Change the column name to make the legend nicer"
usa_big_cities$Population <- usa_big_cities$pop
usa_big_cities$Qualitative <- usa_big_cities$qual
# Do some additional refining:
ggplot(data = usa, mapping = aes(x = long, y= lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = usa_big_cities,
aes(x = long, y = lat, group = NULL,
color = Qualitative, size = Population)) +
scale_size_continuous(label = comma)
- Work with Ohio counties with external data:
# Get basic map data for all USA counties:
usa_counties = map_data("county")
# Subset to counties in Ohio:
oh = subset(usa_counties, region == "ohio")
head(oh)
#> long lat group order region subregion
#> 59960 -83.66902 39.02989 2012 59960 ohio adams
#> 59961 -83.56590 39.02989 2012 59961 ohio adams
#> 59962 -83.37109 39.06426 2012 59962 ohio adams
#> 59963 -83.30806 39.06426 2012 59963 ohio adams
#> 59964 -83.30233 39.05280 2012 59964 ohio adams
#> 59965 -83.25649 39.01842 2012 59965 ohio adams
# Plot ohio counties
ggplot() +
geom_polygon(data = oh,
aes(x = long, y = lat, group = group, fill = subregion),
color = "black", alpha = 0.3) +
coord_fixed(1.3) +
guides(fill = FALSE) +
ggtitle("Ohio counties")
- Read population data for Ohio counties:
# The data of the estimated population of each county in 2021 and percent change from 2010
Ohio <- read_csv("Ohio.csv")
head(Ohio)
#> # A tibble: 6 x 6
#> county Pop Perc Town x_1 y_1
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 Adams 27706 -2.91 West Union -83.5 38.8
#> 2 Allen 101603 -4.47 Lima -84.1 40.7
#> 3 Ashland 53040 -0.53 Ashland -82.3 40.9
#> 4 Ashtabula 96549 -4.79 Jefferson -80.8 41.7
#> 5 Athens 64943 -0.35 Athens -82.1 39.3
#> 6 Auglaize 45496 -0.88 Wapakoneta -84.2 40.6
- Prepare the data for plotting:
# Create a new column called "county" so that counties start with capital letters
# using str_to_title function
oh$county <- str_to_title(oh$subregion)
# Merge population data with counties data by county variable using inner_join
# function, and named the new object "ohio_pop"
ohio_pop <- inner_join(oh, Ohio, by = "county")
# Select counties with population greater than 100000
ohio_big_pop <- filter(ohio_pop, Pop > 100000)
- Create the plot where we vary point size by population size:
### vary size of point by population size
ggplot(data = ohio_pop, mapping = aes(x = long, y = lat, group = group))+
geom_polygon(color = "black", fill = "white")+
geom_point(data = ohio_big_pop, color = "red",
aes(x = x_1, y = y_1, group = NULL, size = Pop)) +
coord_fixed(1.3) +
guides(size = FALSE) # Omit the legend
- Improve the graph by creating groups of population using quantile.
ApplyQuintiles <- function(x) {
cut(x, breaks = c(quantile(ohio_pop$Pop, probs = seq(0, 1, by = 0.2))),
labels = c("0-20", "20-40", "40-60", "60-80", "80-100"),
include.lowest = TRUE)
}
ohio_pop$grouped_pop <- sapply(ohio_pop$Pop, ApplyQuintiles)
head(ohio_pop)
#> long lat group order region subregion county Pop Perc Town
#> 1 -83.66902 39.02989 2012 59960 ohio adams Adams 27706 -2.91 West Union
#> 2 -83.56590 39.02989 2012 59961 ohio adams Adams 27706 -2.91 West Union
#> 3 -83.37109 39.06426 2012 59962 ohio adams Adams 27706 -2.91 West Union
#> 4 -83.30806 39.06426 2012 59963 ohio adams Adams 27706 -2.91 West Union
#> 5 -83.30233 39.05280 2012 59964 ohio adams Adams 27706 -2.91 West Union
#> 6 -83.25649 39.01842 2012 59965 ohio adams Adams 27706 -2.91 West Union
#> x_1 y_1 grouped_pop
#> 1 -83.54616 38.79483 0-20
#> 2 -83.54616 38.79483 0-20
#> 3 -83.54616 38.79483 0-20
#> 4 -83.54616 38.79483 0-20
#> 5 -83.54616 38.79483 0-20
#> 6 -83.54616 38.79483 0-20
- Plot the map:
ggplot() +
geom_polygon(data = ohio_pop,
aes(x = long, y = lat, group = group, fill = grouped_pop),
color = "black") +
coord_fixed(1.3) +
scale_fill_brewer(palette = "Reds", direction = 1) +
labs(fill = "Population Quantiles")
Breakout rooms
Exercise 1
Use the dataset of 2021 Ohio county’s population to plot counties with % positive population growth.
Solution (click here)
# Get basic map data for all USA counties:
usa_counties = map_data("county")
# Subset to counties in Ohio:
oh = subset(usa_counties, region == "ohio")
# Read population data:
Ohio <- read_csv("Ohio.csv")
# Create a new column called "county" so that counties start with capital
# letters using str_to_title function
oh$county = str_to_title(oh$subregion)
# Merge counties with population:
ohio_pop<-inner_join(oh, Ohio, by = "county")
# Select counties with % positive population growth:
ohio_pos_pop <- filter(ohio_pop, Perc > 0)
ggplot(data = ohio_pop,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = ohio_pos_pop, color = "red",
aes(x = x_1, y = y_1, group = NULL, size = Perc)) +
coord_fixed(1.3) +
guides(size = FALSE) # Omit the legend
Exercise 2
Use the same data to plot counties with % negative population growth with quantile of 0-20, 20-40, 40-60, 60-80, and 80-100.
Solution (click here)
ohio_neg_pop <- filter(ohio_pop, Perc<0)
ggplot(data = ohio_pop, mapping = aes(x= long, y= lat, group = group))+
geom_polygon(color="black",fill="white")+
geom_point(data = ohio_neg_pop, color = "red",
aes(x = x_1,y = y_1, group = NULL, size = Perc)) +
coord_fixed(1.3) +
guides(size = FALSE) # Omit the legend
ApplyQuintiles_n <- function(x) {
cut(x, breaks = c(quantile(ohio_neg_pop$Perc, probs = seq(0, 1, by = 0.2))),
labels = c("0-20", "20-40", "40-60", "60-80", "80-100"),
include.lowest = TRUE)
}
ohio_neg_pop$grouped_pop <- sapply(ohio_neg_pop$Perc, ApplyQuintiles_n)
# Plot the map
ggplot() +
geom_polygon(data = ohio_neg_pop,
aes(x = long, y = lat, group = group, fill = grouped_pop),
color = "black") +
coord_fixed(1.3) +
scale_fill_brewer(palette = "Set1", direction = -1) +
labs(fill = "Negative population growth counties")
Bonus exercise
Plot the cities of France with population greater than 100,000. Vary size of point by city size, and vary the color of the dots.
Solution (click here)
world <- map_data("world")
france <- filter(world,region == "France")
ggplot(data = france,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
labs(fill = "France")
# The "maps" package has city data
head(maps::world.cities)
#> name country.etc pop lat long capital
#> 1 'Abasan al-Jadidah Palestine 5629 31.31 34.34 0
#> 2 'Abasan al-Kabirah Palestine 18999 31.32 34.35 0
#> 3 'Abdul Hakim Pakistan 47788 30.55 72.11 0
#> 4 'Abdullah-as-Salam Kuwait 21817 29.36 47.98 0
#> 5 'Abud Palestine 2456 32.03 35.07 0
#> 6 'Abwein Palestine 3434 32.03 35.20 0
my_cities <-maps::world.cities
france_cities <- filter(my_cities, country.etc == "France")
head(france_cities)
#> name country.etc pop lat long capital
#> 1 Abbeville France 26656 50.12 1.83 0
#> 2 Acheres France 23219 48.97 2.06 0
#> 3 Agde France 23477 43.33 3.46 0
#> 4 Agen France 34742 44.20 0.62 0
#> 5 Aire-sur-la-Lys France 10470 50.64 2.39 0
#> 6 Aix-en-Provence France 148622 43.53 5.44 0
# Make a point for every city:
ggplot(data = france, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = france_cities, color = "red",
aes(x = long, y = lat, group = NULL))
# Let's pick just the big cities:
france_big_cities <- filter(my_cities,country.etc == "France" & pop > 100000)
ggplot(data = france,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = france_big_cities, color = "red",
aes(x = long, y = lat, group = NULL))
# vary size of point by city size
ggplot(data = france, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = france_big_cities, color = "red",
aes(x = long, y = lat, group = NULL, size = pop))
# Now vary the color of the dots
france_big_cities$qual <- sample(LETTERS[1:5],
nrow(france_big_cities),
replace = TRUE)
ggplot(data = france,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black",fill = "white") +
geom_point(data = france_big_cities,
aes(x = long, y = lat, group = NULL, color = qual, size = pop))
# Do some tweaking:
# no scientific notation in legend r ggplot
# scales package adds the "scale_size_continuous" function to our arsenal, and we can set label=comma
library(scales)
# Change the column name to make the legend nicer:
france_big_cities$Population <- france_big_cities$pop
france_big_cities$Qualitative <- france_big_cities$qual
# Do some additional refining:
ggplot(data = france,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(color = "black", fill = "white") +
geom_point(data = france_big_cities,
aes(x = long, y = lat, group = NULL,
color = Qualitative, size = Population)) +
scale_size_continuous(label = comma)