topic_model_workshop

These scripts are intended as basic introductions to two topic modeling methods: structural topic models (STM; Roberts et al., 2014) and BERTopic (Grootendorst, 2022).

STM is a "semiautomated approach" of analyzing text data that "discover[s] topics" within documents (e.g., open-ended responses, narratives). The result is "a model where each open-ended response is a mixture of topics" that also "allow[s] for the inclusion of covariates of interest into the prior distributions for document-topic proportions and topic-word distributions" (e.g., "the author’s gender, political affiliation, and treatment assignment"). (pp. 1064-1067)

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58(4), 1064–1082.https://doi.org/10.1111/ajps.12103

BERTopic is another method that "extract[s] coherent topic representation" by "approach[ing] topic modeling as a clustering task". Specifically, it "generates document embedding[s] with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure." (p. 1)

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.05794

For instructional purposes, the scripts work off of the Dreaddit dataset (Turcan & McKeown, 2019).

Turcan, E., & McKeown, K. (2019). Dreaddit: A reddit dataset for stress analysis in social media. arXiv preprint arXiv:1911.00133. https://doi.org/10.48550/arXiv.1911.00133

Usage

Scripts for a given method (STM, BERTopic) are housed within their respective folders (/stm/, /bertopic/)

STM

For STM, launch stm.RProj and then the script (stm.qmd, a Quarto file). The script is configured to point to the example Dreaddit dataset in the /input/ folder, but adjust as necessary if using your own data. Note that you can browse dreaddit-train.csv in the /input/ folder for an example of how the input data file can/should be set up. If using different formatting or variable names, please adjust the script (stm.qmd) to match.

BERTopic

For BERTopic, the script (bertopic.ipynb) is currently configured to be run in Google Colab (see [hopefully functional] button above). In order to run the script in Google Colab, you'll need to also upload the example Dreaddit data file (dreaddit-train.csv in the /input/ folder) or whichever other data file you'd like to work with onto your Google Drive. The script will have instructions as to how to then point to that uploaded data file in your Google Drive. Again, adjust as necessary to make sure it's pointing to the correct variables and file locations.

Authors and Acknowledgment

Authors:

Ryan Yeung

Adapted from Make a README template and Best-README-Template.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bertopic		bertopic
input		input
output		output
stm		stm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

topic_model_workshop

Usage

STM

BERTopic

Authors and Acknowledgment

About

Uh oh!

Releases

Packages

Languages

License

ryancyeung/topic_model_workshop

Folders and files

Latest commit

History

Repository files navigation

topic_model_workshop

Usage

STM

BERTopic

Authors and Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages