Skip to content

Commit

Permalink
Rebrand Twitter v2 datasource as compatible with the X Research API
Browse files Browse the repository at this point in the history
  • Loading branch information
stijn-uva committed Nov 4, 2024
1 parent dcc0a21 commit 1871019
Show file tree
Hide file tree
Showing 3 changed files with 95 additions and 103 deletions.
81 changes: 38 additions & 43 deletions datasources/twitterv2/DESCRIPTION.md
Original file line number Diff line number Diff line change
@@ -1,93 +1,88 @@
Twitter data is gathered through the official [Twitter v2 API](https://developer.twitter.com/en/docs/twitter-api). 4CAT
allows access to both the Standard and the Academic track. The Standard track is free for anyone to use, but only
allows to retrieve tweets up to seven days old. The Academic track allows a full-archive search of up to ten million
tweets per month (as of March 2022). For the Academic track, you need a valid Bearer token. You can request one
[here](https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you).
X/Twitter data is gathered through the official [X v2 API](https://developer.twitter.com/en/docs/twitter-api). 4CAT can interface with X's Research API (sometimes
branded as the 'DSA API', referencing the EU's Digital Services Act). To retrieve posts via this API with 4CAT, you need
a valid Bearer token. Read more about this mode of access [here](https://developer.x.com/en/use-cases/do-research/academic-research).

Tweets are captured in batches at a speed of approximately 100,000 tweets per hour. 4CAT will warn you if your dataset
Posts are captured in batches at a speed of approximately 100,000 posts per hour. 4CAT will warn you if your dataset
is expected to take more than 30 minutes to collect. It is often a good idea to start small (with very specific
queries or narrow date ranges) and then only create a larger dataset if you are confident that it will be manageable and
useful for your analysis.

If you hit your Twitter API quota while creating a dataset, the dataset will be finished with the tweets that have been
If you hit your X API quota while creating a dataset, the dataset will be finished with the posts that have been
collected so far and a warning will be logged.

### Query syntax

Check the [API documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query)
Check the [API documentation](https://developer.x.com/en/docs/x-api/tweets/search/integrate/build-a-query)
for available query syntax and operators. This information is crucial to what data you collect. Important operators for
instance include `-is:nullcast` and `-is:retweet`, with which you can ignore promoted tweets and retweets. Query syntax
is roughly the same as for Twitter's search interface, so you can try out most queries by entering them in the Twitter
app or website's search field and looking at the results. You can also test queries with
Twitter's [Query Builder](https://developer.twitter.com/apitools/query?query=).
instance include `-is:nullcast` and `-is:retweet`, with which you can ignore promoted posts and reposts. Query syntax
is roughly the same as for X's search interface, so you can try out most queries by entering them in the X app or
website's search field and looking at the results. You can also test queries with
X's [Query Builder](https://developer.twitter.com/apitools/query?query=).

### Date ranges

By default, Twitter returns tweets posted within the past 30 days. If you want to go back further, you need to
explicitly set a date range. Note that Twitter does not like date ranges that end in the future, or start before
Twitter existed. If you want to capture tweets "until now", it is often best to use yesterday as an end date.
By default, X returns posts posted within the past 30 days. If you want to go back further, you need to
explicitly set a date range. Note that X does not like date ranges that end in the future, or start before
Twitter existed. If you want to capture tweets "until now", it is often best to use yesterday as an end date. Also note
that API access may come with certain limitations on how far a query may extend into history.

### Geo parameters

Twitter offers a number of ways
to [query by location/geo data](https://developer.twitter.com/en/docs/tutorials/filtering-tweets-by-location)
such as `has:geo`, `place:Amsterdam`, or `place:Amsterdam`. This feature is only available for the Academic level;
you will receive a 400 error if using queries filtering by geographic information.
X offers a number of ways
to [query by location/geo data](https://developer.x.com/en/docs/tutorials/filtering-tweets-by-location)
such as `has:geo`, `place:Amsterdam`, or `place:Amsterdam`.

### Retweets

A retweet from Twitter API v2 contains at maximum 140 characters from the original tweet. 4CAT therefore
gathers both the retweet and the original tweet and reformats the retweet text so it resembles a user's experience.
A repost from X API v2 contains at maximum 140 characters from the original post. 4CAT therefore
gathers both the repost and the original post and reformats the repost text so it resembles a user's experience.

This also affects mentions, hashtags, and other data as only those contained in the first 140 characters are provided
by Twitter API v2 with the retweet. Additional hashtags, mentions, etc. are taken from the original tweet and added
to the retweet for 4CAT analysis methods. *4CAT stores the data from Twitter API v2 as similar as possible to the format
by X API v2 with the retweet. Additional hashtags, mentions, etc. are taken from the original tweet and added
to the repost for 4CAT analysis methods. *4CAT stores the data from X API v2 as similar as possible to the format
in which it was received which you can obtain by downloading the ndjson file.*

*Example 1*

[This retweet](https://twitter.com/tonino1630/status/1554618034299568128) returns the following data:
[This repost](https://x.com/tonino1630/status/1554618034299568128) returns the following data:

- *author:* `tonino1630`
- *
text:* `RT @ChuckyFrao: ¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar p…`
- *text:* `RT @ChuckyFrao: ¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar p…`
- *mentions:* `ChuckyFrao`
- *hashags:*

<br>
While the original tweet will return (as a reference tweet) this data:
While the original post will return (as a reference post) this data:

- *author:* `ChuckyFrao`
- *
text:* `¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar presos estadounidenses en otros países. #FreeAlexSaab @POTUS @usembassyve @StateSPEHA @StateDept @SecBlinken #BringAlexHome #IntegridadTerritorial https://t.co/ClSQ3Rfax0`
- *text:* `¡HUELE A LIBERTAD! La Casa Blanca publicó una orden ejecutiva sobre las acciones del Gobierno de Joe Biden para negociar presos estadounidenses en otros países. #FreeAlexSaab @POTUS @usembassyve @StateSPEHA @StateDept @SecBlinken #BringAlexHome #IntegridadTerritorial https://t.co/ClSQ3Rfax0`
- *mentions:* `POTUS, usembassyve, StateSPEHA, StateDept, SecBlinken`
- *hashtags:* `FreeAlexSaab, BringAlexHome, IntegridadTerritorial`

<br>
As you can see, only the author of the original tweet is listed as a mention in the retweet.
As you can see, only the author of the original post is listed as a mention in the repost.

*Example 2*

[This retweet](https://twitter.com/Macsmart31/status/1554618041459445760) returns the following:
[This repost](https://x.com/Macsmart31/status/1554618041459445760) returns the following:

- *author:* `Macsmart31`
- *
text:* `RT @mickyd123us: @tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the de…`
- *text:* `RT @mickyd123us: @tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the de…`
- *mentions:* `mickyd123us, tribelaw, HonorDecency`

<br>
Compared with the original tweet referenced below:
Compared with the original post referenced below:

- *author:* `mickyd123us`
- *
text:* `@tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the detail he had in the basement. Who knows where they would have taken him. https://t.co/s47Kb5RrCr`
- *text:* `@tribelaw @HonorDecency Thank goodness Biden replaced his detail - we know that Pence refused to "Take A Ride" with the detail he had in the basement. Who knows where they would have taken him. https://t.co/s47Kb5RrCr`
- *mentions:* `tribelaw, HonorDecency`

<br>
Because the mentioned users are in the first 140 characters of the original tweet, they are also listed as mentions in the retweet.

The key difference here is that example one the retweet contains none of the hashtags or mentions from the original
tweet (they are beyond the first 140 characters) while the second retweet example does return mentions from the original
tweet. *Due to this discrepancy, for retweets all mentions and hashtags of the original tweet are considered as mentions
and hashtags of the retweet.* A user on Twitter will see all mentions and hashtags when viewing a retweet and the
retweet would be a part of any network around those mentions and hashtags.
Because the mentioned users are in the first 140 characters of the original post, they are also listed as mentions in
the repost.

The key difference here is that in example one the repost contains none of the hashtags or mentions from the original
post (they are beyond the first 140 characters) while the second repost example does return mentions from the original
post. *Due to this discrepancy, for reposts all mentions and hashtags of the original post are considered as mentions
and hashtags of the repost.* A user on X will see all mentions and hashtags when viewing a repost and the
repost would be a part of any network around those mentions and hashtags.
2 changes: 1 addition & 1 deletion datasources/twitterv2/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@

# Internal identifier for this data source
DATASOURCE = "twitterv2"
NAME = "Twitter API (v2) Search"
NAME = "X/Twitter API (v2) Search"
Loading

0 comments on commit 1871019

Please sign in to comment.