You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/clustering.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,6 +122,7 @@ By and large there are two types of methods that can be used for importance esti
122
122
| - | - | - | - |
123
123
|`soft-c-tf-idf`*(default)*| Lexical | A c-tf-idf mehod that can interpret soft cluster assignments. | Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf. |
124
124
|`fighting-words`**(NEW)**| Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See [Fightin' Words paper](https://languagelog.ldc.upenn.edu/myl/Monroe.pdf). |
125
+
|`npmi`**(NEW)**| Lexical | Estimate term importance from mutual information between cluster labels and term occurrence. | Theoretically motivated, fast, and usually produces clean topics. |
125
126
|`c-tf-idf`| Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
126
127
|`centroid`| Semantic | Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec. | Produces clean topics, easily interpretable. |
127
128
|`linear`**(NEW, EXPERIMENTAL)**| Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
@@ -195,6 +196,14 @@ By and large there are two types of methods that can be used for importance esti
195
196
model = ClusteringTopicModel(feature_importance="linear")
196
197
```
197
198
199
+
=== "NPMI"
200
+
201
+
```python
202
+
from turftopic import ClusteringTopicModel
203
+
204
+
model = ClusteringTopicModel(feature_importance="npmi")
205
+
```
206
+
198
207
199
208
200
209
You can also choose to recalculate term importances with a different method after fitting the model:
0 commit comments