You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: 07-tweet-archives.Rmd
+11-11
Original file line number
Diff line number
Diff line change
@@ -31,10 +31,10 @@ David and Julia tweet at about the same rate currently and joined Twitter about
31
31
32
32
Let's use `unnest_tokens()` to make a tidy data frame of all the words in our tweets, and remove the common English stop words. There are certain conventions in how people use text on Twitter, so we will use a specialized tokenizer and do a bit more work with our text here than, for example, we did with the narrative text from Project Gutenberg.
33
33
34
-
First, we will remove tweets from this dataset that are retweets so that we only have tweets that we wrote ourselves. Next, the `mutate()` line cleans out some characters that we don't want like ampersands and such.
34
+
First, we will remove tweets from this dataset that are retweets so that we only have tweets that we wrote ourselves. Next, the `mutate()` line removes links and cleans out some characters that we don't want like ampersands and such.
35
35
36
36
```{block, type = "rmdnote"}
37
-
In the call to `unnest_tokens()`, we unnest using the specialized `"tweets"` tokenizer that is built in to the tokenizers package [@R-tokenizers]. This tool is very useful for dealing with Twitter text or other text from online forums; it retains hashtags and mentions of usernames with the `@` symbol.
37
+
In the call to `unnest_tokens()`, we unnest using a regex pattern, instead of just looking for single unigrams (words). This regex pattern very useful for dealing with Twitter text or other text from online forums; it retains hashtags and mentions of usernames with the `@` symbol.
38
38
```
39
39
40
40
Because we have kept text such as hashtags and usernames in the dataset, we can't use a simple `anti_join()` to remove stop words. Instead, we can take the approach shown in the `filter()` line that uses `str_detect()` from the stringr package.
@@ -43,11 +43,13 @@ Because we have kept text such as hashtags and usernames in the dataset, we can'
We are about equally likely to tweet about words, science, ideas, and email.
138
+
We are about equally likely to tweet about maps, email, files, and APIs.
137
139
138
140
Which words are most likely to be from Julia's account or from David's account? Let's just take the top 15 most distinctive words for each account and plot them in Figure \@ref(fig:plotratios).
139
141
@@ -270,16 +272,14 @@ Now that we have this second, smaller set of only recent tweets, let's again use
Notice that the `word` column contains tokenized emoji.
282
-
283
283
To start with, let’s look at the number of times each of our tweets was retweeted. Let's find the total number of retweets for each person.
284
284
285
285
```{r rt_totals, dependson = "tidy_tweets2"}
@@ -328,7 +328,7 @@ word_by_rts %>%
328
328
y = "Median # of retweets for tweets containing each word")
329
329
```
330
330
331
-
We see lots of word about R packages, including tidytext, a package about which you are reading right now!
331
+
We see lots of word about R packages, including tidytext, a package about which you are reading right now! The "0" for David comes from tweets where he mentions version numbers of packages, like ["broom 0.4.0"](https://twitter.com/drob/status/671430703234576384) or similar.
332
332
333
333
We can follow a similar procedure to see which words led to more favorites. Are they different than the words that lead to more retweets?
0 commit comments