You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+26Lines changed: 26 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,6 +20,32 @@
20
20
21
21
> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
22
22
23
+
### New in version 0.10.0
24
+
25
+
You can interactively explore clusters using `datamapplot` directly in Turftopic!
26
+
You will first have to install `datamapplot` for this to work.
27
+
28
+
```python
29
+
from turftopic import ClusteringTopicModel
30
+
from turftopic.namers import OpenAITopicNamer
31
+
32
+
model = ClusteringTopicModel(feature_importance="centroid")
33
+
model.fit(corpus)
34
+
35
+
namer = OpenAITopicNamer("gpt-4o-mini")
36
+
model.rename_topics(namer)
37
+
38
+
fig = model.plot_clusters_datamapplot()
39
+
fig.save("clusters_visualization.html")
40
+
fig
41
+
```
42
+
> If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
Copy file name to clipboardExpand all lines: docs/basics.md
+34-1Lines changed: 34 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -282,7 +282,40 @@ model.print_topics()
282
282
283
283
### Visualization
284
284
285
-
Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic model interpretation is fully compatible with Turftopic models.
285
+
#### Datamapplot *(clustering models only)*
286
+
287
+
You can interactively explore clusters using [datamapplot](https://github.com/TutteInstitute/datamapplot) directly in Turftopic!
288
+
You will first have to install `datamapplot` for this to work:
289
+
290
+
```bash
291
+
pip install turftopic[datamapplot]
292
+
```
293
+
294
+
```python
295
+
from turftopic import ClusteringTopicModel
296
+
from turftopic.namers import OpenAITopicNamer
297
+
298
+
model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)
299
+
300
+
namer = OpenAITopicNamer("gpt-4o-mini")
301
+
model.rename_topics(namer)
302
+
303
+
fig = model.plot_clusters_datamapplot()
304
+
fig.save("clusters_visualization.html")
305
+
fig
306
+
```
307
+
!!! info
308
+
If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
Copy file name to clipboardExpand all lines: docs/clustering.md
+44-6Lines changed: 44 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,29 +18,36 @@ that the other libraries boast.
18
18
from sklearn.manifold importTSNE
19
19
from turftopic import ClusteringTopicModel
20
20
21
-
model = ClusteringTopicModel(clustering=TSNE())
21
+
model = ClusteringTopicModel(dimensionality_reduction=TSNE())
22
22
```
23
23
24
24
It is common practice to reduce the dimensionality of the embeddings before clustering them.
25
25
This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by.
26
-
Dimensionality reduction by default is done with scikit-learn's **TSNE** implementation in Turftopic,
26
+
Dimensionality reduction by default is done with [**TSNE**](https://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne) in Turftopic,
27
27
but users are free to specify the model that will be used for dimensionality reduction.
28
28
29
+
!!! tip "Use openTSNE for better performance!"
30
+
By default, a scikit-learn implementation is used, but if you have the [openTSNE](https://github.com/pavlin-policar/openTSNE) package installed on your system, Turftopic will automatically use it.
31
+
You can potentially speed up your clustering topic models by multiple orders of magnitude.
32
+
```bash
33
+
pip install turftopic[opentsne]
34
+
```
35
+
29
36
??? note "What reduction model should I choose?"
30
37
Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature.
31
-
Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
38
+
Top2Vec and BERTopic both use [UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html), which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
32
39
33
40
### Clustering
34
41
35
42
```python
36
-
from sklearn.cluster importOPTICS
43
+
from sklearn.cluster importHDBSCAN
37
44
from turftopic import ClusteringTopicModel
38
45
39
-
model = ClusteringTopicModel(clustering=OPTICS())
46
+
model = ClusteringTopicModel(clustering=HDBSCAN())
40
47
```
41
48
42
49
After reducing the dimensionality of the embeddings, they are clustered with a clustering model.
43
-
As HDBSCAN has only been part of scikit-learn since version 1.3.0, Turftopic uses **OPTICS** as its default.
50
+
Turftopic uses [**HDBSCAN**](https://scikit-learn.org/stable/modules/clustering.html#hdbscan) as its default.
44
51
45
52
??? note "What clustering model should I choose?"
46
53
Some clustering models are capable of discovering the number of clusters in the data (HDBSCAN, DBSCAN, OPTICS, etc.).
@@ -174,6 +181,37 @@ To reset topics to the original clustering, use the `reset_topics()` method:
174
181
model.reset_topics()
175
182
```
176
183
184
+
### Visualization
185
+
186
+
You can interactively explore clusters using [datamapplot](https://github.com/TutteInstitute/datamapplot) directly in Turftopic!
187
+
You will first have to install `datamapplot` for this to work:
188
+
189
+
```bash
190
+
pip install turftopic[datamapplot]
191
+
```
192
+
193
+
```python
194
+
from turftopic import ClusteringTopicModel
195
+
from turftopic.namers import OpenAITopicNamer
196
+
197
+
model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)
198
+
199
+
namer = OpenAITopicNamer("gpt-4o-mini")
200
+
model.rename_topics(namer)
201
+
202
+
fig = model.plot_clusters_datamapplot()
203
+
fig.save("clusters_visualization.html")
204
+
fig
205
+
```
206
+
!!! info
207
+
If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
0 commit comments