Session 19: Word Clouds via Tidytext





New To Code Club?

  • First, check out the Code Club Computer Setup instructions, which also has some pointers that might be helpful if you’re new to R or RStudio.

  • Please open RStudio before Code Club to test things out – if you run into issues, join the Zoom call early and we’ll troubleshoot.


Session Goals

  • Learn the fundamentals of text mining.
  • Learn how to do text mining in a tidyverse setting.
  • Reuse some of our dplyr and ggplot skills on text.
  • Learn how to very simply create word cloud visualizations.

Setup

This is another in our current series on text processing. We’ll be using the following previously used packages which you should load first (install them if you haven’t already):

We’ll also be using the following packages, which you should install and load:

# Uncomment the following line to install:
# install.packages(c("tidytext", "gutenbergr", "wordcloud"))
                 
library(tidytext)
library(gutenbergr)
library(wordcloud)

Introduction

In this CodeClub session we’ll see how to create word clouds (also known as tag clouds) from text, using the tidytext and wordcloud packages. A word cloud is a visualization of word frequencies, graphically highlighting the most common words.

We need to get some text from somewhere, so first let’s do it in the simplest possible way. Here we manually enter a quote, line by line, as a vector of five character strings. This is the first stanza from Robert Lowell’s Skunk Hour:

lowell <- c("Nautilus Island's hermit",
          "heiress still lives through winter in her Spartan cottage;",
          "her sheep still graze above the sea.",
          "Her son's a bishop. Her farmer is first selectman in our village;",
          "she's in her dotage.")

In textual analysis we distinguish between word types, and word tokens (multiple instances of those words in text). For example there are two tokens of the word-type “still” in this stanza:

heiress still lives through winter
her sheep still graze above the sea

And slightly more abstractly there are four tokens of “her”, modulo capitalization:

her Spartan cottage
her sheep still graze
Her son’s a bishop.
Her farmer

Formally, it’s the token frequency of the word types we are ultimately interested in capturing. So: two tasks, extract the word tokens, and count them! Done!

The reason this is tricky is that natural language text is messy: the task of extracting a clean set of tokens to count is termed text mining or tokenization. We would also like to get the output into a tidyverse compliant data frame, so we can use familiar dplyr and ggplot functions to analyze it.

We could imagine attacking this using stingr functions:

lowell_tokens <- lowell %>% 
  # convert upper to lower case; returns a character vector.
  str_to_lower() %>%
  # remove punctuation with a character class; returns a list.
  str_extract_all("[a-z]+") %>% 
  # flatten that list
  unlist() %>%
  # stick it in a data frame
  as_tibble()                      

print(lowell_tokens, n = 38)

#> # A tibble: 38 x 1
#>    value    
#>    <chr>    
#>  1 nautilus 
#>  2 island   
#>  3 s        
#>  4 hermit   
#>  5 heiress  
#>  6 still    
#>  7 lives    
#>  8 through  
#>  9 winter   
#> 10 in       
#> 11 her      
#> 12 spartan  
#> 13 cottage  
#> 14 her      
#> 15 sheep    
#> 16 still    
#> 17 graze    
#> 18 above    
#> 19 the      
#> 20 sea      
#> 21 her      
#> 22 son      
#> 23 s        
#> 24 a        
#> 25 bishop   
#> 26 her      
#> 27 farmer   
#> 28 is       
#> 29 first    
#> 30 selectman
#> 31 in       
#> 32 our      
#> 33 village  
#> 34 she      
#> 35 s        
#> 36 in       
#> 37 her      
#> 38 dotage

This is a good start: it gets rid of the capitalization issue, and also gets rid of the punctuation. But there’s a problem. The regular expression pattern [a-z]+ doesn’t recognize possessives or contractions: it just strips anything that’s not a letter, so it messes up with Island's, son's, and she's: welcome to the subtleties of processing natural language text algorithmically! Exceptions, exceptions!!

We could fiddle about with our regex, but… there’s a package for that! This kind of text mining is exactly what the tidytext package was built for. It will simultaneously strip punctuation intelligently and ‘unnest’ lines into word tokens.

Tidytext functions need a dataframe to operate on. So first we need to get the poem into a data frame; here we’ll use the column name text.

lowell_df <- tibble(text = lowell)

lowell_df

#> # A tibble: 5 x 1
#>   text                                                             
#>   <chr>                                                            
#> 1 Nautilus Island's hermit                                         
#> 2 heiress still lives through winter in her Spartan cottage;       
#> 3 her sheep still graze above the sea.                             
#> 4 Her son's a bishop. Her farmer is first selectman in our village;
#> 5 she's in her dotage.

Each string in the character vector becomes a single row in the data frame.

Again we want one word-token per row, to ‘tidy’ our data. This is what tidytext::unnest_tokens() does. We’re going to unnest words in this case (we can unnest other things, like characters, sentences, regexes, even tweets) and we need to specify the variable in the dataframe we are unnesting (in this case just text). This will create a new word-token data frame, and we’ll name the variable in the data frame word. This is important (see later on stop words).

lowell_tidy <- lowell_df %>%
    unnest_tokens(word, text)

print(lowell_tidy, n = 35)

#> # A tibble: 35 x 1
#>    word     
#>    <chr>    
#>  1 nautilus 
#>  2 island's 
#>  3 hermit   
#>  4 heiress  
#>  5 still    
#>  6 lives    
#>  7 through  
#>  8 winter   
#>  9 in       
#> 10 her      
#> 11 spartan  
#> 12 cottage  
#> 13 her      
#> 14 sheep    
#> 15 still    
#> 16 graze    
#> 17 above    
#> 18 the      
#> 19 sea      
#> 20 her      
#> 21 son's    
#> 22 a        
#> 23 bishop   
#> 24 her      
#> 25 farmer   
#> 26 is       
#> 27 first    
#> 28 selectman
#> 29 in       
#> 30 our      
#> 31 village  
#> 32 she's    
#> 33 in       
#> 34 her      
#> 35 dotage

Punctuation has been stripped and all words are lower case, but possessives and contractions are preserved (fancy usage of str_ regular expression functions under the hood…).

Bakeoff!

Now that we have the basic idea, let’s look at a more interesting data set, from the bakeoff package.

First we’ll create a data frame with just the signature column from the bakes data set:

signature_df <- select(bakes, signature)

signature_df

#> # A tibble: 548 x 1
#>    signature                                                                    
#>    <chr>                                                                        
#>  1 "Light Jamaican Black Cakewith Strawberries and Cream"                       
#>  2 "Chocolate Orange Cake"                                                      
#>  3 "Caramel Cinnamon and Banana Cake"                                           
#>  4 "Fresh Mango and Passion Fruit Hummingbird Cake"                             
#>  5 "Carrot Cake with Lime and Cream Cheese Icing"                               
#>  6 "Cranberry and Pistachio Cakewith Orange Flower Water Icing"                 
#>  7 "Carrot and Orange Cake"                                                     
#>  8 "Sticky Marmalade Tea Loaf"                                                  
#>  9 "Triple Layered Brownie Meringue Cake\nwith Raspberry Cream"                 
#> 10 "Three Tiered Lemon Drizzle Cakewith Fresh Cream and freshly made Lemon Curd"
#> # … with 538 more rows

Next we tokenize by word on the signature column:

signature_tidy <- signature_df %>%
    unnest_tokens(word, signature)

signature_tidy

#> # A tibble: 2,762 x 1
#>    word        
#>    <chr>       
#>  1 light       
#>  2 jamaican    
#>  3 black       
#>  4 cakewith    
#>  5 strawberries
#>  6 and         
#>  7 cream       
#>  8 chocolate   
#>  9 orange      
#> 10 cake        
#> # … with 2,752 more rows

Now we want to count those tokens: i.e. we want to collapse all duplicate word tokens into a single word type, with the corresponding frequency. Since we now have tidy data, dplyr to the rescue!

dplyr count() lets you quickly count the unique values of one or more variables. The option sort, if TRUE, will show the largest groups at the top.

signature_count <- signature_tidy %>% 
    count(word, sort = TRUE)

signature_count

#> # A tibble: 806 x 2
#>    word          n
#>    <chr>     <int>
#>  1 and         321
#>  2 cake         66
#>  3 chocolate    61
#>  4 orange       42
#>  5 with         42
#>  6 pie          37
#>  7 apple        34
#>  8 ginger       30
#>  9 lemon        29
#> 10 biscuits     26
#> # … with 796 more rows

We’re way more interested in cake than and: this is an example of a stop word:

In computing, stop words are words which are filtered out before or after processing of natural language data (text). “stop words” usually refers to the most common words in a language.

One of our major performance (search) optimizations… is removing the top 10,000 most common English dictionary words (as determined by Google search). It’s shocking how little is left of most posts once you remove the top 10k English dictionary words…

The tidytext package has a database of just over a thousand of these words, including ‘and’:

print(stop_words, n = 30)

#> # A tibble: 1,149 x 2
#>    word        lexicon
#>    <chr>       <chr>  
#>  1 a           SMART  
#>  2 a's         SMART  
#>  3 able        SMART  
#>  4 about       SMART  
#>  5 above       SMART  
#>  6 according   SMART  
#>  7 accordingly SMART  
#>  8 across      SMART  
#>  9 actually    SMART  
#> 10 after       SMART  
#> 11 afterwards  SMART  
#> 12 again       SMART  
#> 13 against     SMART  
#> 14 ain't       SMART  
#> 15 all         SMART  
#> 16 allow       SMART  
#> 17 allows      SMART  
#> 18 almost      SMART  
#> 19 alone       SMART  
#> 20 along       SMART  
#> 21 already     SMART  
#> 22 also        SMART  
#> 23 although    SMART  
#> 24 always      SMART  
#> 25 am          SMART  
#> 26 among       SMART  
#> 27 amongst     SMART  
#> 28 an          SMART  
#> 29 and         SMART  
#> 30 another     SMART  
#> # … with 1,119 more rows

Note that the name of the stop word column is word, and the name we used in our tokenized column is word (now you will see why we used that name) so we can use dplyr’s anti_join() to filter the word tokens!

anti_join() returns all rows from x without a match in y (where x are the word tokens, and y are the stop words)

signature_count <- signature_tidy %>% 
    count(word, sort = TRUE) %>% 
    anti_join(stop_words)

#> Joining, by = "word"


signature_count

#> # A tibble: 762 x 2
#>    word          n
#>    <chr>     <int>
#>  1 cake         66
#>  2 chocolate    61
#>  3 orange       42
#>  4 pie          37
#>  5 apple        34
#>  6 ginger       30
#>  7 lemon        29
#>  8 biscuits     26
#>  9 loaf         22
#> 10 walnut       22
#> # … with 752 more rows

Since we are in the tidyverse, we can pipe our results into ggplot. First we filter on counts above a certain threshold (here 12, just for visualization purposes):

signature_count %>%
    filter(n > 12) %>%
    ggplot(aes(n, word)) +
    geom_col() +
    theme_minimal() +
    labs(y = NULL)

This is ordered alphabetically by default, bottom to top; but we can reorder by count (n) using dplyr mutate():

signature_count %>%
    filter(n > 12) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(n, word)) +
    geom_col() +
    theme_minimal() +
    labs(y = NULL)

We now have everything we need for a word cloud: word types and their token frequencies:

The only obligatory arguments to wordcloud() are the first two: the rest just let you tweak the graphic:

wordcloud(words = signature_count$word, 
          freq = signature_count$n, 
          min.freq = 12, 
          random.order=FALSE, 
          rot.per=0.3, 
          colors=brewer.pal(8, "Dark2"))

min.freq lets you filter on a frequency threshold. random.order=FALSE plots words in decreasing frequency (highest most central); rot.per is the proportion of words with 90 degree rotation; colors=brewer.pal(8, "Dark2") lets you choose an RColorBrewer color palette of your choice.

Lemmatization

If you create a count data frame of signature_tidy without the sort = TRUE option, the words are sorted alphabetically. And if you look through that table you will see many instances such as apple, apples; apricot, apricots; cake, cakes etc. Arguably, these are the same word type (think “dictionary word”) just grammatical variations. Properly collapsing these into a single type is called lemmatization: a very difficult problem which would take us far afield into the morphology of words. Again in general there are many exceptions, only partly due to English borrowing so many words from other languages: besides apple, apples there is mouse, mice; self, selves; bacillus, bacilli; basis, bases. etc. These are known as irregular plurals.

Verbs are worse! Perhaps you would also consider the inflectional forms run, runs, ran, running as the same type, just as a dictionary does. How do you reduce those algorithmically? And if you consider inflectional forms as the same dictionary word, how would you tackle Ancient Greek, which has hundreds of inflected forms for the same verb? Here are just a few, there are pages and pages of them…

Currently machine learning has been unleashed on this problem, with limited success. The traditional computational linguists' algorithms are still winning…

The gutenbergr package

Say we wanted to do a word cloud for a more substantive text like Darwin’s Origin of Species.

Project Gutenberg is a volunteer effort to digitize and archive cultural works and is the oldest digital library. It has over 60,000 books in the public domain (including Darwin’s works).

The gutenbergr package allows you to download any of these works directly into a data frame using just the Project Gutenberg ID. This is then perfect input for tidytext. The package provides all the metadata to search for author and work IDs inside R (you can also just find the ID by searching on the Project Gutenberg website):

darwins_works <- gutenberg_metadata %>%
    filter(author == "Darwin, Charles")

darwins_works

#> # A tibble: 40 x 8
#>    gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
#>           <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
#>  1          944 "The… Darwi…              485 en       Travel/Harvard … Publi…
#>  2         1227 "The… Darwi…              485 en       NA               Publi…
#>  3         1228 "On … Darwi…              485 en       Harvard Classic… Publi…
#>  4         2009 "The… Darwi…              485 en       Harvard Classic… Publi…
#>  5         2010 "The… Darwi…              485 en       NA               Publi…
#>  6         2087 "Lif… Darwi…              485 en       NA               Publi…
#>  7         2088 "Lif… Darwi…              485 en       NA               Publi…
#>  8         2300 "The… Darwi…              485 en       NA               Publi…
#>  9         2355 "The… Darwi…              485 en       NA               Publi…
#> 10         2485 "The… Darwi…              485 en       Botany           Publi…
#> # … with 30 more rows, and 1 more variable: has_text <lgl>

An inspection of the results of Origin of Species on the website reveals that the latest edition is ID 2009. Let’s grab it:

OoS <- gutenberg_download(2009)

#> Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

#> Using mirror http://aleph.gutenberg.org

In the breakout rooms, we’ll work through inspecting the frequencies and creating a word cloud for this text.

The gutenbergr package is extremely useful, but as long as you can read a document into R, you can then convert it to a data frame as we did in the very first example above, and then the tidytext pipeline will work. The readtext package can import a variety of formats, including PDFs and Microsoft Word documents.

Breakout rooms

Exercise 1

Run the command:

OoS <- gutenberg_download(2009)

and inspect the data frame. Identify the name of the column you want to tokenize.

Then use the unnest_tokens() command to create a data frame of word tokens.

Hints (click here)
It's the text column you want. gutenbergr includes the gutenberg_ID in case you download multiple texts into the same data frame. Remember to name the column in the new data frame word so we can filter any stop words later on.

Solution (click here)
OoS <- gutenberg_download(2009)

OoS

#> # A tibble: 21,556 x 2
#>    gutenberg_id text                                                            
#>           <int> <chr>                                                           
#>  1         2009 "1228    1859, First Edition"                                   
#>  2         2009 "22764   1860, Second Edition"                                  
#>  3         2009 "2009    1872, Sixth Edition, considered the definitive edition…
#>  4         2009 ""                                                              
#>  5         2009 ""                                                              
#>  6         2009 ""                                                              
#>  7         2009 ""                                                              
#>  8         2009 "On the Origin of Species"                                      
#>  9         2009 ""                                                              
#> 10         2009 "BY MEANS OF NATURAL SELECTION,"                                
#> # … with 21,546 more rows
OoS_tidy <- OoS %>%
    unnest_tokens(word, text)
    
OoS_tidy

#> # A tibble: 209,048 x 2
#>    gutenberg_id word   
#>           <int> <chr>  
#>  1         2009 1228   
#>  2         2009 1859   
#>  3         2009 first  
#>  4         2009 edition
#>  5         2009 22764  
#>  6         2009 1860   
#>  7         2009 second 
#>  8         2009 edition
#>  9         2009 2009   
#> 10         2009 1872   
#> # … with 209,038 more rows


Exercise 2

Count and sort the tokens into a new data frame. Inspect the output. Are there any stop words?

Hints (click here)
Pipe the word column of the data frame into the dplyr count() function with the sort = TRUE option.

Solution (click here)
OoS_count <- OoS_tidy %>% 
    count(word, sort = TRUE)

OoS_count

#> # A tibble: 9,233 x 2
#>    word      n
#>    <chr> <int>
#>  1 the   14570
#>  2 of    10438
#>  3 and    5853
#>  4 in     5414
#>  5 to     4753
#>  6 a      3368
#>  7 that   2749
#>  8 as     2230
#>  9 have   2114
#> 10 be     2099
#> # … with 9,223 more rows


Exercise 3

Remove the stop words from the output and inspect the results.

Hints (click here)
Use antijoin() with the tidytext stop_words data frame:

Solution (click here)
OoS_count <- OoS_tidy %>%
    count(word, sort = TRUE) %>% 
    anti_join(stop_words)

#> Joining, by = "word"


OoS_count

#> # A tibble: 8,678 x 2
#>    word          n
#>    <chr>     <int>
#>  1 species    1921
#>  2 forms       565
#>  3 selection   561
#>  4 natural     535
#>  5 varieties   486
#>  6 plants      471
#>  7 animals     436
#>  8 distinct    357
#>  9 life        350
#> 10 nature      325
#> # … with 8,668 more rows


Exercise 4

Visualize the counts using ggplot(), from highest frequency to lowest, using a frequency cutoff of 200. Does any one word stand out in any way?

Does the tidytext package perform lemmatization? Are there any irregular plurals in this result?

Hints (click here)
Use a dplyr filter() command on the n column, and, well just look at the examples in the presentation for the details of piping it into ggplot()!

Solution (click here)
OoS_count %>%
    filter(n > 200) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(n, word)) +
    geom_col() +
    theme_minimal() +
    labs(y = NULL)


tidytext does not lemmatize. There are many plurals in this list, so undoubtedly there are corresponding singulars of lower frequency. Indeed we see both forms and form. And of course the irregular genera is the plural of genus.


Exercise 5

Create a word cloud of this data frame, with the same frequency cut off as the ggplot() (200). Otherwise use the same settings as in the presentation. Tweak those settings, especially the frequency threshold and rotation proportion. See what happens when you set random.order=TRUE.

Hints (click here)
The option for the the frequency threshold is min.freq = 200.

Solution (click here)
wordcloud(words = OoS_count$word, 
          freq = OoS_count$n, 
          min.freq = 200, 
          random.order=FALSE, 
          rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))




Michael Broe
Michael Broe
Bioinformatician at EEOB