01_textmining.qmd

---
title: "Sentiment Analysis"

author: 
  - name: John R Little
    affiliations:
      - name: Duke University
      - department: Center for Data & Vizualization Sciences

# date: 'today'
date-modified: 'today'
date-format: long

format: 
  html:
    embed-resources: true
    footer: "[John R Little](https://JohnLittle.info) ● [Center for Data & Visualization Sciences](https://library.duke.edu/data/) ● [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)"
    logo:  images/Rfun_logo.png
    license: CC BY    
    toc: true
    toc_float: true
    df-print: paged
---

Find this repository: https://github.com/libjohn/workshop_textmining

Much of this review comes from the site: https://juliasilge.github.io/tidytext/

The primary library package `tidytext` enables all kinds of text mining. See Also this helpful free online book: [Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/) by Silge and Robinson

```{r}
library(janeaustenr)
library(tidyverse)
library(tidytext)
library(wordcloud2)
library(textdata)
```

```{r}
#| echo: false
htmltools::img(src = knitr::image_uri(here::here("images", "Rfun_logo.png")),
alt = 'Rfun',
style = 'position:absolute; bottom:15px; left:0; padding:5px; border:0px;')

htmltools::img(src = knitr::image_uri(here::here("images", "CDVS-logo_sm_Spring2020.png")),
alt = 'Rfun',
style = 'position:absolute; bottom:0; right:0; padding:5px; border:0px;')
```

## Data

We'll look at some books by [Jane Austen](https://en.wikipedia.org/wiki/Jane_Austen), an 18th century novelist. Austen explored women and marriage within the British upper class. The novelist has a unique and well earned following within literature. Her works is consistently discussed and honored. To this day, Austen's novels are the source of many adaptations, written and on-screen. Through the `janeaustenr` package we can access and mine the text of six Austen novels. We can call the collection of novels a corpra. An individual novel is a corpus.

```{r}
austen_books()
```

Austen is best know for six published works:

```{r}
austen_books() %>% 
  distinct(book)
```

## Data Cleaning

Text mining typically requires a lot of data cleaning. In this case, we start with the `janeaustenr` collection that has already been cleaned. Nonetheless, further data wrangling is required. First, identifying a line number for each line of text in each book.

## Identify line numbers

```{r}
original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(line = row_number()) %>%         # identify line numbers
  ungroup()

original_books
```

## Tokens

To work with these data as a **tidy** dataset, we need to restructure the data through *tokenization*. In our case a token is a single word. We want **one-token-per-row**. The `unnest_tokens()` function (tidytext package) will convert a data frame with a text column into the one-token-per-row format.

**Token**\
**Tokenization**\
[defined](https://www.techopedia.com/definition/13698/tokenization)

The default tokenizing mode is "words". With the `unnest_tokens()` function, tokens can be: **words**, characters, character_shingles, **ngrams**, skip_ngrams, **sentences**, lines, paragraphs, regex, tweets, and ptb (Penn Treebank).

### Process

1.  Group by **line number** (above)
2.  Make each single word a token

```{r}
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books
```

> Now that the data is in the one-word-per-row format, we can manipulate it with tidy tools like dplyr.

## Stop Words

`tidytext::get_stopwords()`

Remove stop-words from the books.

```{r}
matchwords_books <- tidy_books %>%
  anti_join(get_stopwords())

matchwords_books
```

### Join types

![](https://pbs.twimg.com/media/B6eUTTACUAAahLf.png "Dplyr Join Diagram")

### Customize your dictionaries

You can customize stop-words data frames, sentiment data frames, etc.

There are various *stop words* dictionaries. Here we add the stop word, "farfegnugen" to a custom dictionary. If Jane Austen ever used the word "farfegnugen" that would be weird, or bad. So we will take pains to not calculate the sentiment of that word - whether or not the term shows up in a sentiment dictionary. That is, we will remove the word by making it a part of a customized stop-words dictionary.

```{r}
stopwords::stopwords_getsources()
stopwords::stopwords_getlanguages("snowball")

stopwords_custom <- tribble(~word, ~lexicon,
                            "farfegnugen", "custom")

stopwords_custom

get_stopwords(source = "snowball")

bind_rows(get_stopwords(), stopwords_custom)    # The default is "snowball"

```

### Calculate word frequency

How many Austen countable words are there if we remove *snowball* stop-words? There are `r   nrow(dplyr::distinct(matchwords_books, word))` countable words.

```{r}
matchwords_books %>% 
  # distinct(word)
  count(word, sort = TRUE) 
```

## Word clouds

```{r interactive word cloud, fig.width=10}
matchwords_books %>%
  count(word, sort = TRUE) %>%
  head(100) %>% 
  wordcloud2(size = .4, shape = 'triangle-forward', 
             color = c("steelblue", "firebrick", "darkorchid"), 
             backgroundColor = "salmon")

```

### Basic word cloud

A non-interactive word cloud.

```{r basic word cloud, fig.height=8, fig.width=8}
matchwords_books %>%
  count(word) %>%
  with(wordcloud::wordcloud(word, n, max.words = 100))
```

## Your Turn: Exercise 1

Goal: Make a basic word cloud for the novel, *Pride and Predjudice*, `pride_prej_novel`

a.  Prepare

```{r}
pride_prej_novel <- tibble(text = prideprejudice) %>% 
  mutate(line = row_number())
```

b.  Tokenize `pride_prej_novel` with `unnest_tokens()`

```{r}

```

c.  Remove stop-words

```{r}

```

d.  calculate word frequency

```{r}

```

e.  make a simple wordcloud

```{r}

```

## Sentiment Analysis

`get_sentiments()`

Let's see what positive words exist in the bing dictionary. Then, count the frequency of those positive words that exist in *Emma*.

```{r}
positive <- get_sentiments("bing") %>%
  filter(sentiment == "positive")                    # get POSITIVE words

positive 

tidy_books %>%
  filter(book == "Emma") %>%                        # only the book _emma_
  semi_join(positive) %>%                           # semi_join()
  count(word, sort = TRUE)
```

### Prepare to visualize sentiment score

Match all the Austen books to the bing sentiment dictionary. Count the word frequency.

```{r}
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book)
```

### Calculate sentiment

> **Algorithm:** sentiment = positive - negative

Define a section of text.

> "Small sections of text may not have enough words in them to get a good estimate of sentiment while really large sections can wash out narrative structure. For these books, using 80 lines works well, but this can vary depending on individual texts... -- [Text Mining with R](https://www.tidytextmining.com/sentiment.html)

```{r echo=TRUE}
bing <- get_sentiments("bing")

janeaustensentiment <- tidy_books %>% 
  inner_join(bing) %>% 
  count(book, index = line %/% 80, sentiment) %>%                          # `%/%` = int division ; 80 lines / section
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%    # spread(sentiment, n, fill = 0)
  mutate(sentiment = positive - negative)                                      # ALGO!!!
  
janeaustensentiment
```

### Viz it

```{r sentiment score}
janeaustensentiment %>%
  ggplot(aes(index, sentiment, )) +
  geom_col(show.legend = FALSE, fill = "cadetblue") +
  geom_col(data = . %>% filter(sentiment < 0), show.legend = FALSE, fill = "firebrick") +
  geom_hline(yintercept = 0, color = "goldenrod") +
  facet_wrap(~ book, ncol = 2, scales = "free_x") 
```

### Preparation: Most common positive and negative words

```{r}
bing_word_counts <- tidy_books %>%
  inner_join(bing) %>%
  count(word, sentiment, sort = TRUE)

bing_word_counts
```

### Viz it too

```{r positive and negative, fig.height=7, fig.width=10}
bing_word_counts %>%
  filter(n > 170) %>%
  mutate(n = if_else(sentiment == "negative", - n, n)) %>%
  ggplot(aes(fct_reorder(str_to_title(word), n), n, fill = str_to_title(sentiment))) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(type = "qual") +
  guides(fill = guide_legend(reverse = TRUE)) +
  labs(title = "Frequency of popular positive and negative words",
       subtitle = "Jane Austen novels",
       y = "Compound sentiment score", x = "",
       fill = "Sentiment", caption = "Source: library(janeaustenr)") +
  theme(plot.title.position = "plot")
```

## Dictionaries

What other dictionaries are available? How to choose?

-   [Without Dictiionaries there is no sentiment analysis](http://www.thinkingondata.com/without-dictionaries-no-sentiment-analysis/)
-   [Sentiment Analysis: Analyzing Lexicon Quality and Estimation Errors](https://paulvanderlaken.com/2017/12/27/sentiment-analysis-lexicon-quality/)
-   [Limits of the Bing, AFINN, and NRC Lexicons with the Tidytext Package in R](https://hoyeolkim.wordpress.com/2018/02/25/the-limits-of-the-bing-afinn-and-nrc-lexicons-with-the-tidytext-package-in-r/)
-   [Case Study with Harry Potter](https://afit-r.github.io/sentiment_analysis)

```{r}
head(get_sentiments("bing"))
head(get_sentiments("loughran"))
head(get_sentiments("nrc"))
head(get_sentiments("afinn"))

get_sentiments("nrc") %>% 
  count(sentiment, sort = TRUE) 

```

## Afinn

What words in *Emma* match the AFINN dictionary?

```{r}
emma_afinn <- tidy_books %>%
  filter(book == "Emma") %>% 
  anti_join(get_stopwords()) %>% 
  inner_join(get_sentiments("afinn"))

emma_afinn
```

```{r}
emma_afinn %>% 
  count(word, sort = TRUE)
```

### Make Sections

Just as we calculated sentiment, above, make sections of 80 words then calculate sentiment.

```{r}
emma_afinn_sentiment <- emma_afinn %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 80) %>% 
  group_by(index) %>% 
  summarise(sentiment = sum(value))           ## ALGO sum each Afinn score in the 80 word section

emma_afinn_sentiment

```

### Viz it

```{r emma word cloud}
emma_afinn %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 80) %>% 
  filter(index == 104) %>%
  count(word, sort = TRUE) %>%
  with(wordcloud::wordcloud(word, n, 
                            rot.per = .3))

emma_afinn %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 80) %>% 
  filter(index == 104) %>%
  count(word, sort = TRUE) %>%
  wordcloud2(size = .4, shape = 'diamond',
             backgroundColor = "darkseagreen")

```

```{r emma afinn}
emma_afinn_sentiment %>% 
  ggplot(aes(index, sentiment)) +
  geom_col(aes(fill = cut_interval(sentiment, n = 5))) +
  geom_hline(yintercept = 0, color = "forestgreen", linetype = "dashed") +
  scale_fill_brewer(palette = "RdBu", guide = FALSE) +
  theme(panel.background = element_rect(fill = "grey"),
        plot.background = element_rect(fill = "grey"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  labs(title = "Afinn Sentiment Analysis of _Emma_")
```

```{r emma boxplot of afinn}
emma_afinn %>%
  mutate(word_count = 1:n(),
         index = as.character(word_count %/% 80)) %>%
  filter(index == 10 | index == 104 | index == 105) %>% 
  ggplot(aes(value, index)) +
  geom_boxplot() +
  # geom_boxplot(notch = TRUE) +
  geom_jitter() +
  coord_flip() +
  labs(y = "section", x = "Afinn")
```

## Resources

-   [Tidytext package](https://juliasilge.github.io/tidytext/)
-   Book: [Text Mining with R](https://www.tidytextmining.com/) by Silge and Robinson
-   Data Wrangling with dplyr: ([video](https://juliasilge.github.io/tidytext/) \| [workshop](https://rfun.library.duke.edu/portfolio/r_flipped/))
-   Data Visualization with ggplot2: ([video](https://warpwire.duke.edu/w/80YEAA/) \| [workshop](https://rfun.library.duke.edu/portfolio/ggplot_workshop/))