Skip to content

Commit 4733be8

Browse files
Merge pull request #74 from x-tabdeveloping/datamapplot
Clustering model updates
2 parents 6dfbcfb + 496fe73 commit 4733be8

File tree

7 files changed

+577
-18
lines changed

7 files changed

+577
-18
lines changed

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,32 @@
2020

2121
> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
2222
23+
### New in version 0.10.0
24+
25+
You can interactively explore clusters using `datamapplot` directly in Turftopic!
26+
You will first have to install `datamapplot` for this to work.
27+
28+
```python
29+
from turftopic import ClusteringTopicModel
30+
from turftopic.namers import OpenAITopicNamer
31+
32+
model = ClusteringTopicModel(feature_importance="centroid")
33+
model.fit(corpus)
34+
35+
namer = OpenAITopicNamer("gpt-4o-mini")
36+
model.rename_topics(namer)
37+
38+
fig = model.plot_clusters_datamapplot()
39+
fig.save("clusters_visualization.html")
40+
fig
41+
```
42+
> If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
43+
44+
<figure>
45+
<img src="docs/images/cluster_datamapplot.png" width="70%" style="margin-left: auto;margin-right: auto;">
46+
<figcaption>Interactive figure to explore cluster structure in a clustering topic model.</figcaption>
47+
</figure>
48+
2349
### New in version 0.9.0
2450

2551
#### Dynamic S³ 🧭

docs/basics.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,40 @@ model.print_topics()
282282

283283
### Visualization
284284

285-
Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic model interpretation is fully compatible with Turftopic models.
285+
#### Datamapplot *(clustering models only)*
286+
287+
You can interactively explore clusters using [datamapplot](https://github.com/TutteInstitute/datamapplot) directly in Turftopic!
288+
You will first have to install `datamapplot` for this to work:
289+
290+
```bash
291+
pip install turftopic[datamapplot]
292+
```
293+
294+
```python
295+
from turftopic import ClusteringTopicModel
296+
from turftopic.namers import OpenAITopicNamer
297+
298+
model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)
299+
300+
namer = OpenAITopicNamer("gpt-4o-mini")
301+
model.rename_topics(namer)
302+
303+
fig = model.plot_clusters_datamapplot()
304+
fig.save("clusters_visualization.html")
305+
fig
306+
```
307+
!!! info
308+
If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
309+
310+
<figure>
311+
<iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:800px;width:800px;padding:0px;border:none;"></iframe>
312+
<figcaption> Interactive figure to explore cluster structure in a clustering topic model. </figcaption>
313+
</figure>
314+
315+
316+
#### topicwizard
317+
318+
Turftopic integrates with [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic visualization.
286319

287320
```bash
288321
pip install topic-wizard

docs/clustering.md

Lines changed: 44 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,29 +18,36 @@ that the other libraries boast.
1818
from sklearn.manifold import TSNE
1919
from turftopic import ClusteringTopicModel
2020

21-
model = ClusteringTopicModel(clustering=TSNE())
21+
model = ClusteringTopicModel(dimensionality_reduction=TSNE())
2222
```
2323

2424
It is common practice to reduce the dimensionality of the embeddings before clustering them.
2525
This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by.
26-
Dimensionality reduction by default is done with scikit-learn's **TSNE** implementation in Turftopic,
26+
Dimensionality reduction by default is done with [**TSNE**](https://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne) in Turftopic,
2727
but users are free to specify the model that will be used for dimensionality reduction.
2828

29+
!!! tip "Use openTSNE for better performance!"
30+
By default, a scikit-learn implementation is used, but if you have the [openTSNE](https://github.com/pavlin-policar/openTSNE) package installed on your system, Turftopic will automatically use it.
31+
You can potentially speed up your clustering topic models by multiple orders of magnitude.
32+
```bash
33+
pip install turftopic[opentsne]
34+
```
35+
2936
??? note "What reduction model should I choose?"
3037
Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature.
31-
Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
38+
Top2Vec and BERTopic both use [UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html), which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
3239

3340
### Clustering
3441

3542
```python
36-
from sklearn.cluster import OPTICS
43+
from sklearn.cluster import HDBSCAN
3744
from turftopic import ClusteringTopicModel
3845

39-
model = ClusteringTopicModel(clustering=OPTICS())
46+
model = ClusteringTopicModel(clustering=HDBSCAN())
4047
```
4148

4249
After reducing the dimensionality of the embeddings, they are clustered with a clustering model.
43-
As HDBSCAN has only been part of scikit-learn since version 1.3.0, Turftopic uses **OPTICS** as its default.
50+
Turftopic uses [**HDBSCAN**](https://scikit-learn.org/stable/modules/clustering.html#hdbscan) as its default.
4451

4552
??? note "What clustering model should I choose?"
4653
Some clustering models are capable of discovering the number of clusters in the data (HDBSCAN, DBSCAN, OPTICS, etc.).
@@ -174,6 +181,37 @@ To reset topics to the original clustering, use the `reset_topics()` method:
174181
model.reset_topics()
175182
```
176183

184+
### Visualization
185+
186+
You can interactively explore clusters using [datamapplot](https://github.com/TutteInstitute/datamapplot) directly in Turftopic!
187+
You will first have to install `datamapplot` for this to work:
188+
189+
```bash
190+
pip install turftopic[datamapplot]
191+
```
192+
193+
```python
194+
from turftopic import ClusteringTopicModel
195+
from turftopic.namers import OpenAITopicNamer
196+
197+
model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)
198+
199+
namer = OpenAITopicNamer("gpt-4o-mini")
200+
model.rename_topics(namer)
201+
202+
fig = model.plot_clusters_datamapplot()
203+
fig.save("clusters_visualization.html")
204+
fig
205+
```
206+
!!! info
207+
If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
208+
209+
<figure>
210+
<iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:800px;width:800px;padding:0px;border:none;"></iframe>
211+
<figcaption> Interactive figure to explore cluster structure in a clustering topic model. </figcaption>
212+
</figure>
213+
214+
177215
### Manual Topic Merging
178216

179217
You can also manually merge topics using the `join_topics()` method.

0 commit comments

Comments
 (0)