Skip to content

Commit ad4ecb0

Browse files
Added TokenCountVectorizer to docs, moved some things into tabs
1 parent e5e6500 commit ad4ecb0

File tree

3 files changed

+151
-66
lines changed

3 files changed

+151
-66
lines changed

docs/model_definition_and_training.md

Lines changed: 108 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -19,59 +19,29 @@ You might want to have a look at the [Models](models.md) page in order to make a
1919

2020
Here are some examples of models you can load and use in the package:
2121

22-
<table>
23-
<tr>
24-
<td> Model </td> <td> Example Definition </td>
25-
</tr>
26-
<tr>
27-
<td>
22+
=== "KeyNMF"
2823

29-
<a href="https://x-tabdeveloping.github.io/turftopic/KeyNMF/">KeyNMF</a>
24+
```python
25+
from turftopic import KeyNMF
3026

31-
</td>
32-
<td>
27+
model = KeyNMF(n_components=10, top_n=15)
28+
```
3329

34-
```python
35-
from turftopic import KeyNMF
36-
37-
model = KeyNMF(n_components=10, top_n=15)
38-
```
39-
40-
</td>
41-
</tr>
42-
<tr>
43-
<td>
30+
=== "ClusteringTopicModel"
4431

45-
<a href="https://x-tabdeveloping.github.io/turftopic/clustering/">ClusteringTopicModel</a>
32+
```python
33+
from turftopic import ClusteringTopicModel
4634

47-
</td>
48-
<td>
35+
model = ClusteringTopicModel(n_reduce_to=10, feature_importance="centroid")
36+
```
4937

50-
```python
51-
from turftopic import ClusteringTopicModel
52-
53-
model = ClusteringTopicModel(n_reduce_to=10, feature_importance="centroid")
54-
```
38+
=== "SemanticSignalSeparation"
5539

56-
</td>
57-
</tr>
58-
<tr>
59-
<td>
40+
```python
41+
from turftopic import SemanticSignalSeparation
6042

61-
<a href="https://x-tabdeveloping.github.io/turftopic/s3/">SemanticSignalSeparation</a>
62-
63-
</td>
64-
<td>
65-
66-
```python
67-
from turftopic import SemanticSignalSeparation
68-
69-
model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
70-
```
71-
72-
</td>
73-
</tr>
74-
</table>
43+
model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
44+
```
7545

7646
### 2. [Vectorizer](../vectorizers.md)
7747

@@ -88,19 +58,73 @@ default_vectorizer = CountVectorizer(min_df=10, stop_words="english")
8858
```
8959

9060
You can add a custom vectorizer to a topic model upon initializing it,
91-
thereby getting different behaviours. You can for instance use noun-phrases in your model instead of words by using NounPhraseCountVectorizer:
61+
thereby getting different behaviours. You can for instance use noun-phrases in your model instead of words by using `NounPhraseCountVectorizer` or estimate parameters for lemmas by using `LemmaCountVectorizer`
9262

93-
```bash
94-
pip install turftopic[spacy]
95-
python -m spacy download "en_core_web_sm"
96-
```
9763

98-
```python
99-
from turftopic import KeyNMF
100-
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
64+
=== "Noun Phrase Extraction"
10165

102-
model = KeyNMF(10, vectorizer=NounPhraseCountVectorizer())
103-
```
66+
```bash
67+
pip install turftopic[spacy]
68+
python -m spacy download "en_core_web_sm"
69+
```
70+
71+
```python
72+
from turftopic import KeyNMF
73+
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
74+
75+
model = KeyNMF(10, vectorizer=NounPhraseCountVectorizer("en_core_web_sm"))
76+
model.fit(corpus)
77+
model.print_topics()
78+
```
79+
80+
| Topic ID | Highest Ranking |
81+
| - | - |
82+
| | ... |
83+
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
84+
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
85+
| | ... |
86+
87+
88+
=== "Lemma Extraction"
89+
90+
```bash
91+
pip install turftopic[spacy]
92+
python -m spacy download "en_core_web_sm"
93+
```
94+
95+
```python
96+
from turftopic import KeyNMF
97+
from turftopic.vectorizers.spacy import LemmaCountVectorizer
98+
99+
model = KeyNMF(10, vectorizer=LemmaCountVectorizer("en_core_web_sm"))
100+
model.fit(corpus)
101+
model.print_topics()
102+
```
103+
104+
| Topic ID | Highest Ranking |
105+
| - | - |
106+
| 0 | atheist, theist, belief, christians, agnostic, christian, mythology, asimov, abortion, read |
107+
| 1 | morality, moral, immoral, objective, society, animal, natural, societal, murder, morally |
108+
| | ... |
109+
110+
111+
=== "Multilingual Tokenization (Arabic example)"
112+
113+
```python
114+
from turftopic import KeyNMF
115+
from turftopic.vectorizers.spacy import TokenCountVectorizer
116+
117+
# CountVectorizer for Arabic
118+
vectorizer = TokenCountVectorizer("ar", min_df=10)
119+
120+
model = KeyNMF(
121+
n_components=10,
122+
vectorizer=vectorizer,
123+
encoder="Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet"
124+
)
125+
model.fit(corpus)
126+
127+
```
104128

105129
### 3. [Encoder](../encoders.md)
106130

@@ -125,15 +149,35 @@ A Namer is an optional part of your topic modeling pipeline, that can automatica
125149
Namers are technically **not part of your topic model**, and should be used *after training*.
126150
See a detailed guide [here](../namers.md).
127151

128-
```python
129-
from turftopic import KeyNMF
130-
from turftopic.namers import LLMTopicNamer
131-
132-
model = KeyNMF(10).fit(corpus)
133-
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
134-
135-
model.rename_topics(namer)
136-
```
152+
=== "LLM from HuggingFace"
153+
```python
154+
from turftopic import KeyNMF
155+
from turftopic.namers import LLMTopicNamer
156+
157+
model = KeyNMF(10).fit(corpus)
158+
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
159+
160+
model.rename_topics(namer)
161+
```
162+
163+
=== "ChatGPT"
164+
```bash
165+
pip install openai
166+
export OPENAI_API_KEY="sk-<your key goes here>"
167+
```
168+
```python
169+
from turftopic.namers import OpenAITopicNamer
170+
171+
namer = OpenAITopicNamer("gpt-4o-mini")
172+
model.rename_topics(namer)
173+
model.print_topics()
174+
```
175+
176+
| Topic ID | Topic Name | Highest Ranking |
177+
| - | - | - |
178+
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
179+
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
180+
| | ... |
137181

138182
## Training and Inference
139183

docs/vectorizers.md

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ Since the same word can appear in multiple forms in a piece of text, one can som
113113

114114
### Extracting lemmata with `LemmaCountVectorizer`
115115

116-
Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a SpaCy pipeline for extracting lemmas from a piece of text.
116+
Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](spacy.io) pipeline for extracting lemmas from a piece of text.
117117
This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
118118

119119
```bash
@@ -173,12 +173,49 @@ model.print_topics()
173173
| 4 | atheist, theist, belief, asimov, philosoph, mytholog, strong, faq, agnostic, weak |
174174
| | | ... |
175175

176-
## Chinese Vectorizer
176+
## Non-English Vectorization
177+
178+
You may find that, especially with non-Indo-European languages, `CountVectorizer` does not perform that well.
179+
In these cases we recommend that you use a vectorizer with its own language-specific tokenization rules and stop-word list:
180+
181+
### Vectorizing Any Language with `TokenCountVectorizer`
182+
183+
The [SpaCy](spacy.io) package includes language-specific tokenization and stop-word rules for just about any language.
184+
We provide a vectorizer that you can use with the language of your choice.
185+
186+
```bash
187+
pip install turftopic[spacy]
188+
```
189+
190+
!!! note
191+
Note that you do not have to install any SpaCy pipelines for this to work.
192+
No pipelines or models will be loaded with `TokenCountVectorizer` only a language-specific tokenizer.
193+
194+
```python
195+
from turftopic import KeyNMF
196+
from turftopic.vectorizers.spacy import TokenCountVectorizer
197+
198+
# CountVectorizer for Arabic
199+
vectorizer = TokenCountVectorizer("ar", min_df=10)
200+
201+
model = KeyNMF(
202+
n_components=10,
203+
vectorizer=vectorizer,
204+
encoder="Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet"
205+
)
206+
model.fit(corpus)
207+
208+
```
209+
210+
### Extracting Chinese Tokens with `ChineseCountVectorizer`
177211

178212
The Chinese language does not separate tokens by whitespace, unlike most Indo-European languages.
179213
You thus need to use special tokenization rules for Chinese.
180214
Turftopic provides tools for Chinese tokenization via the [Jieba](https://github.com/fxsjy/jieba) package.
181215

216+
!!! note
217+
We recommend that you use Jieba over SpaCy for topic modeling with Chinese.
218+
182219
You will need to install the package in order to be able to use our Chinese vectorizer.
183220

184221
```bash
@@ -213,6 +250,8 @@ model.print_topics()
213250

214251
:::turftopic.vectorizers.spacy.LemmaCountVectorizer
215252

253+
:::turftopic.vectorizers.spacy.TokenCountVectorizer
254+
216255
:::turftopic.vectorizers.snowball.StemmingCountVectorizer
217256

218257
:::turftopic.vectorizers.chinese.ChineseCountVectorizer

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ markdown_extensions:
6060
- admonition
6161
- pymdownx.details
6262
- pymdownx.superfences
63+
- pymdownx.tabbed:
64+
alternate_style: true
6365
- toc:
6466
toc_depth: 2
6567
- pymdownx.arithmatex:

0 commit comments

Comments
 (0)