Skip to content

Commit 9001c1e

Browse files
committed
We don't have token = "tweets" anymore, so undo e9b98a1
1 parent e01b693 commit 9001c1e

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

07-tweet-archives.Rmd

+11-11
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,10 @@ David and Julia tweet at about the same rate currently and joined Twitter about
3131

3232
Let's use `unnest_tokens()` to make a tidy data frame of all the words in our tweets, and remove the common English stop words. There are certain conventions in how people use text on Twitter, so we will use a specialized tokenizer and do a bit more work with our text here than, for example, we did with the narrative text from Project Gutenberg.
3333

34-
First, we will remove tweets from this dataset that are retweets so that we only have tweets that we wrote ourselves. Next, the `mutate()` line cleans out some characters that we don't want like ampersands and such.
34+
First, we will remove tweets from this dataset that are retweets so that we only have tweets that we wrote ourselves. Next, the `mutate()` line removes links and cleans out some characters that we don't want like ampersands and such.
3535

3636
```{block, type = "rmdnote"}
37-
In the call to `unnest_tokens()`, we unnest using the specialized `"tweets"` tokenizer that is built in to the tokenizers package [@R-tokenizers]. This tool is very useful for dealing with Twitter text or other text from online forums; it retains hashtags and mentions of usernames with the `@` symbol.
37+
In the call to `unnest_tokens()`, we unnest using a regex pattern, instead of just looking for single unigrams (words). This regex pattern very useful for dealing with Twitter text or other text from online forums; it retains hashtags and mentions of usernames with the `@` symbol.
3838
```
3939

4040
Because we have kept text such as hashtags and usernames in the dataset, we can't use a simple `anti_join()` to remove stop words. Instead, we can take the approach shown in the `filter()` line that uses `str_detect()` from the stringr package.
@@ -43,11 +43,13 @@ Because we have kept text such as hashtags and usernames in the dataset, we can'
4343
library(tidytext)
4444
library(stringr)
4545
46-
remove_reg <- "&amp;|&lt;|&gt;"
46+
replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
47+
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
48+
4749
tidy_tweets <- tweets %>%
4850
filter(!str_detect(text, "^RT")) %>%
49-
mutate(text = str_remove_all(text, remove_reg)) %>%
50-
unnest_tokens(word, text, token = "tweets") %>%
51+
mutate(text = str_replace_all(text, replace_reg, "")) %>%
52+
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
5153
filter(!word %in% stop_words$word,
5254
!word %in% str_remove_all(stop_words$word, "'"),
5355
str_detect(word, "[a-z]"))
@@ -133,7 +135,7 @@ word_ratios %>%
133135
arrange(abs(logratio))
134136
```
135137

136-
We are about equally likely to tweet about words, science, ideas, and email.
138+
We are about equally likely to tweet about maps, email, files, and APIs.
137139

138140
Which words are most likely to be from Julia's account or from David's account? Let's just take the top 15 most distinctive words for each account and plot them in Figure \@ref(fig:plotratios).
139141

@@ -270,16 +272,14 @@ Now that we have this second, smaller set of only recent tweets, let's again use
270272
```{r tidy_tweets2, dependson = "setup2"}
271273
tidy_tweets <- tweets %>%
272274
filter(!str_detect(text, "^(RT|@)")) %>%
273-
mutate(text = str_remove_all(text, remove_reg)) %>%
274-
unnest_tokens(word, text, token = "tweets", strip_url = TRUE) %>%
275+
mutate(text = str_replace_all(text, replace_reg, "")) %>%
276+
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
275277
filter(!word %in% stop_words$word,
276278
!word %in% str_remove_all(stop_words$word, "'"))
277279
278280
tidy_tweets
279281
```
280282

281-
Notice that the `word` column contains tokenized emoji.
282-
283283
To start with, let’s look at the number of times each of our tweets was retweeted. Let's find the total number of retweets for each person.
284284

285285
```{r rt_totals, dependson = "tidy_tweets2"}
@@ -328,7 +328,7 @@ word_by_rts %>%
328328
y = "Median # of retweets for tweets containing each word")
329329
```
330330

331-
We see lots of word about R packages, including tidytext, a package about which you are reading right now!
331+
We see lots of word about R packages, including tidytext, a package about which you are reading right now! The "0" for David comes from tweets where he mentions version numbers of packages, like ["broom 0.4.0"](https://twitter.com/drob/status/671430703234576384) or similar.
332332

333333
We can follow a similar procedure to see which words led to more favorites. Are they different than the words that lead to more retweets?
334334

0 commit comments

Comments
 (0)