The tidy text format for analyzing texts

Keywords: #textmining

When we want to analyze text data, it’s highly recommended to convert the data set to a Tidy Data Structure, which has three characteristics:

  • Each variable is a column
  • Each observation (token) is a row
  • Each type of observational unit is a table

The name of this process is tokenization, and there are different approaches. To tokenize our text data, we can use the function unnest_tokens() from the package tidytext. This function has the following parameters:

  • \(tbl\) is the data.frame or tibble where we have our text data.
  • \(output\) is the name of the new variable where we will add the tokens.
  • \(input\) is the name of the variable or column of our tibble where we have the text data.
  • \(token\) is the unit for tokenizing. The most common are ‘words’, ‘ngrams’, and ‘sentences’.

Let’s see some examples with different levels of tokenization.

Firstly, we need to load the data set employees_opinion that we can find in the package HRdatsets. This data set contains a sample with 149 positive and negative employee opinions.

library(tidyverse)
library(tidytext)

devtools::install_github("vicencfernandez/HRdatasets")
library(HRdatasets)
employees_opinion
## # A tibble: 149 x 3
##    commentID comment                                                  assessment
##        <int> <chr>                                                    <fct>     
##  1         1 In my 30-year career, I’ve more never been proud and ho… positive  
##  2         2 They will have to burn the building down before I will … positive  
##  3         3 I’m surrounded by people who want to work and who love … positive  
##  4         4 This is the second love of my life.                      positive  
##  5         5 I have been an employee here for 45 years and will stay… positive  
##  6         6 For personal reasons I have been forced to seek employm… positive  
##  7         7 Happy employees don't go looking for other opportunities positive  
##  8         8 These folks walk the walk. Seriously. Truly a company o… positive  
##  9         9 Having this job has changed my life                      positive  
## 10        10 I’ve been working for this company for only two years, … positive  
## # … with 139 more rows

Now, let’s see how to tokenize by words, which means one word by row.

employees_opinion %>% unnest_tokens(output = word, input = comment, token="words")
## # A tibble: 2,583 x 3
##    commentID assessment word  
##        <int> <fct>      <chr> 
##  1         1 positive   in    
##  2         1 positive   my    
##  3         1 positive   30    
##  4         1 positive   year  
##  5         1 positive   career
##  6         1 positive   i’ve  
##  7         1 positive   more  
##  8         1 positive   never 
##  9         1 positive   been  
## 10         1 positive   proud 
## # … with 2,573 more rows

We can also tokenize by groups of two words (called bi-grams) or three words (called tri-grams). Let’s see these cases.

employees_opinion %>%
  unnest_tokens(bigram, comment, "ngrams", n = 2)
## # A tibble: 2,434 x 3
##    commentID assessment bigram     
##        <int> <fct>      <chr>      
##  1         1 positive   in my      
##  2         1 positive   my 30      
##  3         1 positive   30 year    
##  4         1 positive   year career
##  5         1 positive   career i’ve
##  6         1 positive   i’ve more  
##  7         1 positive   more never 
##  8         1 positive   never been 
##  9         1 positive   been proud 
## 10         1 positive   proud and  
## # … with 2,424 more rows
employees_opinion %>%
  unnest_tokens(trigram, comment, "ngrams", n = 3)
## # A tibble: 2,285 x 3
##    commentID assessment trigram          
##        <int> <fct>      <chr>            
##  1         1 positive   in my 30         
##  2         1 positive   my 30 year       
##  3         1 positive   30 year career   
##  4         1 positive   year career i’ve 
##  5         1 positive   career i’ve more 
##  6         1 positive   i’ve more never  
##  7         1 positive   more never been  
##  8         1 positive   never been proud 
##  9         1 positive   been proud and   
## 10         1 positive   proud and honored
## # … with 2,275 more rows

Finally, we can tokenize our text data by sentences (i.e., separated by points). You can see that the first and second opinions have just one sentence. But the third and sixth opinions have several sentences.

employees_opinion %>%
  unnest_tokens(sentence, comment, "sentences")
## # A tibble: 222 x 3
##    commentID assessment sentence                                                
##        <int> <fct>      <chr>                                                   
##  1         1 positive   in my 30-year career, i’ve more never been proud and ho…
##  2         2 positive   they will have to burn the building down before i will …
##  3         3 positive   i’m surrounded by people who want to work and who love …
##  4         3 positive   the energy that comes from that is like magic in a bott…
##  5         4 positive   this is the second love of my life.                     
##  6         5 positive   i have been an employee here for 45 years and will stay…
##  7         6 positive   for personal reasons i have been forced to seek employm…
##  8         6 positive   this makes me incredibly sad, as it really is the best …
##  9         6 positive   after many interviews, i can already see that other com…
## 10         6 positive   even my management, when i told them that i needed to s…
## # … with 212 more rows

Depending on the tokenization level, we can focus on the value and meaning of the words or the words’ context. In the following posts, we will present how to keep working with these tokens.