Skip to content

Commit 6a02107

Browse files
Merge pull request #76 from x-tabdeveloping/chinese
Vectorization utilities
2 parents 4733be8 + c166deb commit 6a02107

26 files changed

+3032
-382
lines changed

README.md

Lines changed: 24 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -16,48 +16,46 @@
1616
- Streamlined scikit-learn compatible API 🛠️
1717
- Easy topic interpretation 🔍
1818
- Automated topic naming with LLMs
19+
- Topic modeling with keyphrases :key:
20+
- Lemmatization and Stemming
1921
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️
2022

2123
> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
2224
23-
### New in version 0.10.0
25+
## New in version 0.11.0: Vectorizers Module
2426

25-
You can interactively explore clusters using `datamapplot` directly in Turftopic!
26-
You will first have to install `datamapplot` for this to work.
27+
You can now use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.
2728

2829
```python
29-
from turftopic import ClusteringTopicModel
30-
from turftopic.namers import OpenAITopicNamer
30+
from turftopic import KeyNMF
31+
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
3132

32-
model = ClusteringTopicModel(feature_importance="centroid")
33+
model = KeyNMF(
34+
n_components=10,
35+
vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
36+
)
3337
model.fit(corpus)
34-
35-
namer = OpenAITopicNamer("gpt-4o-mini")
36-
model.rename_topics(namer)
37-
38-
fig = model.plot_clusters_datamapplot()
39-
fig.save("clusters_visualization.html")
40-
fig
38+
model.print_topics()
4139
```
42-
> If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
43-
44-
<figure>
45-
<img src="docs/images/cluster_datamapplot.png" width="70%" style="margin-left: auto;margin-right: auto;">
46-
<figcaption>Interactive figure to explore cluster structure in a clustering topic model.</figcaption>
47-
</figure>
4840

49-
### New in version 0.9.0
41+
| Topic ID | Highest Ranking |
42+
| - | - |
43+
| | ... |
44+
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
45+
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
46+
| | ... |
5047

51-
#### Dynamic S³ 🧭
48+
Turftopic now also comes with a **Chinese vectorizer** for easier use, as well as a generalist **multilingual vectorizer**.
5249

53-
You can now use Semantic Signal Separation in a dynamic fashion.
54-
This allows you to investigate how semantic axes fluctuate over time, and how their content changes.
5550
```python
56-
from turftopic import SemanticSignalSeparation
51+
from turftopic.vectorizers.chinese import default_chinese_vectorizer
52+
from turftopic.vectorizers.spacy import TokenCountVectorizer
5753

58-
model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10)
54+
chinese_vectorizer = default_chinese_vectorizer()
55+
arabic_vectorizer = TokenCountVectorizer("ar", remove_stopwords=True)
56+
danish_vectorizer = TokenCountVectorizer("da", remove_stopwords=True)
57+
...
5958

60-
model.plot_topics_over_time()
6159
```
6260

6361

0 commit comments

Comments
 (0)