Merge pull request #104 from x-tabdeveloping/readme_draft_updates

x-tabdeveloping · web-flow · commit f1208ab25e2f · 2025-06-17T09:48:23.000+02:00
Incorporated feedback from peer review
diff --git a/README.md b/README.md
@@ -36,13 +36,13 @@ pip install turftopic
 If you intend to use CTMs, make sure to install the package with Pyro as an optional dependency.
 
 ```bash
-pip install turftopic[pyro-ppl]
+pip install "turftopic[pyro-ppl]"
 ```
 
 If you want to use clustering models like BERTopic or Top2Vec, install:
 
 ```bash
-pip install turftopic[umap-learn]
+pip install "turftopic[umap-learn]"
 ```
 
 ### Fitting a Model
@@ -52,6 +52,8 @@ scikit-learn workflows.
 
 Here's an example of how you use KeyNMF, one of our models on the 20Newsgroups dataset from scikit-learn.
 
+> If you are using a Mac, you might have to install the required SSL certificates on your system in order to be able to download the dataset.
+
 ```python
 from sklearn.datasets import fetch_20newsgroups
 
@@ -68,7 +70,8 @@ Turftopic also comes with interpretation tools that make it easy to display and
 ```python
 from turftopic import KeyNMF
 
-model = KeyNMF(20).fit(corpus)
+model = KeyNMF(20)
+document_topic_matrix = model.fit_transform(corpus)
 ```
 
 ### Interpreting Models
@@ -131,6 +134,8 @@ model.print_topic_distribution(
 
 Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!
 
+> You will need to `pip install "turftopic[openai]"` for this to work.
+
 ```python
 from turftopic import KeyNMF
 from turftopic.namers import OpenAITopicNamer
@@ -154,6 +159,8 @@ model.print_topics()
 
 You can use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.
 
+> You will need to `pip install "turftopic[spacy]"` for this to work.
+
 ```python
 from turftopic import BERTopic
 from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
@@ -175,10 +182,34 @@ model.print_topics()
 
 ### Visualization
 
-Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.
+Turftopic comes with a number of visualization and  pretty printing utilities for specific models and specific contexts, such as hierarchical or dynamic topic modelling.
+You will find an overview of these in the [Interpreting and Visualizing Models](https://x-tabdeveloping.github.io/turftopic/model_interpretation/) section of our documentation.
+
+```
+pip install "turftopic[datamapplot, openai]"
+```
+
+```python
+from turftopic import ClusteringTopicModel
+from turftopic.namers import OpenAITopicNamer
+
+model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)
+
+namer = OpenAITopicNamer("gpt-4o-mini")
+model.rename_topics(namer)
+
+fig = model.plot_clusters_datamapplot()
+fig.show()
+```
+
+<center>
+  <img src="https://github.com/x-tabdeveloping/turftopic/blob/main/docs/images/cluster_datamapplot.png?raw=true" width="70%" style="margin-left: auto;margin-right: auto;">
+</center>
+
+In addition, Turftopic is natively supported in [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.
 
 ```bash
-pip install topic-wizard
+pip install "turftopic[topic-wizard]"
 ```
 
 By far the easiest way to visualize your models for interpretation is to launch the topicwizard web app.
@@ -189,10 +220,10 @@ import topicwizard
 topicwizard.visualize(corpus, model=model)
 ```
 
-<figure>
+<center>
   <img src="https://x-tabdeveloping.github.io/topicwizard/_images/screenshot_topics.png" width="70%" style="margin-left: auto;margin-right: auto;">
   <figcaption>Screenshot of the topicwizard Web Application</figcaption>
-</figure>
+</center>
 
 Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/topicwizard/figures.html) in topicwizard for individual HTML figures.
 
diff --git a/paper.bib b/paper.bib
@@ -170,13 +170,15 @@ @inproceedings{sentence_transformers
     abstract = "BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations ({\textasciitilde}65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods."
 }
   
-@software{topicwizard,
-  author = {Kardos, Márton},
-  month = nov,
-  title = {{topicwizard: Pretty and opinionated topic model visualization in Python}},
-  url = {https://github.com/x-tabdeveloping/topic-wizard},
-  version = {0.5.0},
-  year = {2023}
+@misc{topicwizard,
+      title={topicwizard -- a Modern, Model-agnostic Framework for Topic Model Visualization and Interpretation}, 
+      author={Márton Kardos and Kenneth C. Enevoldsen and Kristoffer Laigaard Nielbo},
+      year={2025},
+      eprint={2505.13034},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2505.13034}, 
+      doi="10.48550/arXiv.2505.13034"
 }
 
 @article{discourse_analysis,
diff --git a/paper.md b/paper.md
@@ -34,7 +34,8 @@ bibliography: paper.bib
 
 # Summary
 
-Turftopic is a topic modelling library including a number of recent topic models that go beyond bag-of-words models and can understand text in context, utilizing representations from transformers.
+Topic models are machine learning techniques that are able to discover themes in a set of documents.
+Turftopic is a topic modelling library including a number of recent developments in topic modelling that go beyond bag-of-words models and can understand text in context, utilizing representations from transformers.
 Turftopic focuses on ease of use, providing a unified interface for a number of different modern topic models, and boasting both model-specific and model-agnostic interpretation and visualization utilities.
 While the user is afforded great flexibility in model choice and customization, the library comes with reasonable defaults, so as not to needlessly overwhelm first-time users.
 In addition, Turftopic allows the user to: a) model topics as they change over time, b) learn topics on-line from a stream of texts, c) find hierarchical structure in topics, d) learning topics in multilingual texts and corpora.
@@ -50,10 +51,11 @@ Some attempts have been made at creating unified packages for modern topic model
 These packages, however, have a focus on neural models and topic model evaluation, have abstract and highly specialized interfaces, and do not include some popular topic models.
 Additionally, while model interpretation is fundamental aspect of topic modelling, the interpretation utilities provided in these libraries are fairly limited, especially in comparison with model-specific packages, like BERTopic.
 
-Turftopic unifies state-of-the-art contextual topic models under a superset of the `scikit-learn` [@scikit-learn] API, which users are likely already familiar with, and can be readily included in `scikit-learn` workflows and pipelines.
+Turftopic unifies state-of-the-art contextual topic models under a superset of the `scikit-learn` [@scikit-learn] API, which many users may be familiar with, and can be readily included in `scikit-learn` workflows and pipelines.
 We focused on making Turftopic first and foremost an easy-to-use library that does not necessitate expert knowledge or excessive amounts of code to get started with, but gives great flexibility to power users.
-Furthermore, we included an extensive suite of pretty-printing and visualization utilities that aid users in interpreting their results.
-The library also includes three topic models, which to our knowledge only have implementations in Turftopic, these are: KeyNMF [@keynmf], Semantic Signal Separation (S^3^) [@s3], and GMM, a Gaussian Mixture model of document representations with a soft-c-tf-idf term weighting scheme.
+Furthermore, we included an extensive suite of pretty-printing and model-specific visualization utilities that aid users in interpreting their results.
+In addition, we provide native compatibility with `topicwizard` [@topicwizard], a model-agnostic topic model visualization library.
+The library also includes three topic models that, to our knowledge, only have implementations in Turftopic: KeyNMF [@keynmf], Semantic Signal Separation (S^3^) [@s3], and GMM, a Gaussian Mixture model of document representations with a soft-c-tf-idf term weighting scheme.
 
 # Functionality