Added TokenCountVectorizer to docs, moved some things into tabs

x-tabdeveloping · x-tabdeveloping · commit ad4ecb031103 · 2025-01-08T16:01:42.000+01:00
diff --git a/docs/model_definition_and_training.md b/docs/model_definition_and_training.md
@@ -19,59 +19,29 @@ You might want to have a look at the [Models](models.md) page in order to make a
 
 Here are some examples of models you can load and use in the package:
 
-<table>
-<tr>
-<td> Model </td> <td> Example Definition </td>
-</tr>
-<tr>
-<td>
+=== "KeyNMF"
 
-<a href="https://x-tabdeveloping.github.io/turftopic/KeyNMF/">KeyNMF</a>
+    ```python
+    from turftopic import KeyNMF
 
-</td>
-<td>
+    model = KeyNMF(n_components=10, top_n=15)
+    ```
 
-```python
-from turftopic import KeyNMF
-
-model = KeyNMF(n_components=10, top_n=15)
-```
-
-</td>
-</tr>
-<tr>
-<td>
+=== "ClusteringTopicModel"
 
-<a href="https://x-tabdeveloping.github.io/turftopic/clustering/">ClusteringTopicModel</a>
+    ```python
+    from turftopic import ClusteringTopicModel
 
-</td>
-<td>
+    model = ClusteringTopicModel(n_reduce_to=10, feature_importance="centroid")
+    ```
 
-```python
-from turftopic import ClusteringTopicModel
-
-model = ClusteringTopicModel(n_reduce_to=10, feature_importance="centroid")
-```
+=== "SemanticSignalSeparation"
 
-</td>
-</tr>
-<tr>
-<td>
+    ```python
+    from turftopic import SemanticSignalSeparation
 
-<a href="https://x-tabdeveloping.github.io/turftopic/s3/">SemanticSignalSeparation</a>
-
-</td>
-<td>
-
-```python
-from turftopic import SemanticSignalSeparation
-
-model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
-```
-
-</td>
-</tr>
-</table>
+    model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
+    ```
 
 ### 2. [Vectorizer](../vectorizers.md)
 
@@ -88,19 +58,73 @@ default_vectorizer = CountVectorizer(min_df=10, stop_words="english")
 ```
 
 You can add a custom vectorizer to a topic model upon initializing it,
-thereby getting different behaviours. You can for instance use noun-phrases in your model instead of words by using NounPhraseCountVectorizer:
+thereby getting different behaviours. You can for instance use noun-phrases in your model instead of words by using `NounPhraseCountVectorizer` or estimate parameters for lemmas by using `LemmaCountVectorizer`
 
-```bash
-pip install turftopic[spacy]
-python -m spacy download "en_core_web_sm"
-```
 
-```python
-from turftopic import KeyNMF
-from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
+=== "Noun Phrase Extraction"
 
-model = KeyNMF(10, vectorizer=NounPhraseCountVectorizer())
-```
+    ```bash
+    pip install turftopic[spacy]
+    python -m spacy download "en_core_web_sm"
+    ```
+
+    ```python
+    from turftopic import KeyNMF
+    from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
+
+    model = KeyNMF(10, vectorizer=NounPhraseCountVectorizer("en_core_web_sm"))
+    model.fit(corpus)
+    model.print_topics()
+    ```
+
+    | Topic ID | Highest Ranking |
+    | - | - |
+    | | ... |
+    | 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
+    | 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
+    | | ... |
+
+
+=== "Lemma Extraction"
+
+    ```bash
+    pip install turftopic[spacy]
+    python -m spacy download "en_core_web_sm"
+    ```
+
+    ```python
+    from turftopic import KeyNMF
+    from turftopic.vectorizers.spacy import LemmaCountVectorizer
+
+    model = KeyNMF(10, vectorizer=LemmaCountVectorizer("en_core_web_sm"))
+    model.fit(corpus)
+    model.print_topics()
+    ```
+
+    | Topic ID | Highest Ranking |
+    | - | - |
+    | 0 | atheist, theist, belief, christians, agnostic, christian, mythology, asimov, abortion, read |
+    | 1 | morality, moral, immoral, objective, society, animal, natural, societal, murder, morally |
+    | | ... |
+
+
+=== "Multilingual Tokenization (Arabic example)"
+
+    ```python
+    from turftopic import KeyNMF
+    from turftopic.vectorizers.spacy import TokenCountVectorizer
+
+    # CountVectorizer for Arabic
+    vectorizer = TokenCountVectorizer("ar", min_df=10)
+
+    model = KeyNMF(
+        n_components=10,
+        vectorizer=vectorizer,
+        encoder="Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet"
+    )
+    model.fit(corpus)
+
+    ```
 
 ### 3. [Encoder](../encoders.md)
 
@@ -125,15 +149,35 @@ A Namer is an optional part of your topic modeling pipeline, that can automatica
 Namers are technically **not part of your topic model**, and should be used *after training*.
 See a detailed guide [here](../namers.md).
 
-```python
-from turftopic import KeyNMF
-from turftopic.namers import LLMTopicNamer
-
-model = KeyNMF(10).fit(corpus)
-namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
-
-model.rename_topics(namer)
-```
+=== "LLM from HuggingFace"
+    ```python
+    from turftopic import KeyNMF
+    from turftopic.namers import LLMTopicNamer
+
+    model = KeyNMF(10).fit(corpus)
+    namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
+
+    model.rename_topics(namer)
+    ```
+
+=== "ChatGPT"
+    ```bash
+    pip install openai
+    export OPENAI_API_KEY="sk-<your key goes here>"
+    ```
+    ```python
+    from turftopic.namers import OpenAITopicNamer
+
+    namer = OpenAITopicNamer("gpt-4o-mini")
+    model.rename_topics(namer)
+    model.print_topics()
+    ```
+
+    | Topic ID | Topic Name | Highest Ranking |
+    | - | - | - |
+    | 0 | Operating Systems and Software  | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
+    | 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
+    | | ... |
 
 ## Training and Inference
 
diff --git a/docs/vectorizers.md b/docs/vectorizers.md
@@ -113,7 +113,7 @@ Since the same word can appear in multiple forms in a piece of text, one can som
 
 ### Extracting lemmata with `LemmaCountVectorizer`
 
-Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a SpaCy pipeline for extracting lemmas from a piece of text.
+Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](spacy.io) pipeline for extracting lemmas from a piece of text.
 This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
 
 ```bash
@@ -173,12 +173,49 @@ model.print_topics()
 | 4 | atheist, theist, belief, asimov, philosoph, mytholog, strong, faq, agnostic, weak |
 | | | ... |
 
-## Chinese Vectorizer
+## Non-English Vectorization
+
+You may find that, especially with non-Indo-European languages, `CountVectorizer` does not perform that well.
+In these cases we recommend that you use a vectorizer with its own language-specific tokenization rules and stop-word list:
+
+### Vectorizing Any Language with `TokenCountVectorizer`
+
+The [SpaCy](spacy.io) package includes language-specific tokenization and stop-word rules for just about any language.
+We provide a vectorizer that you can use with the language of your choice.
+
+```bash
+pip install turftopic[spacy]
+```
+
+!!! note
+    Note that you do not have to install any SpaCy pipelines for this to work.
+    No pipelines or models will be loaded with `TokenCountVectorizer` only a language-specific tokenizer.
+
+```python
+from turftopic import KeyNMF
+from turftopic.vectorizers.spacy import TokenCountVectorizer
+
+# CountVectorizer for Arabic
+vectorizer = TokenCountVectorizer("ar", min_df=10)
+
+model = KeyNMF(
+    n_components=10,
+    vectorizer=vectorizer,
+    encoder="Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet"
+)
+model.fit(corpus)
+
+```
+
+### Extracting Chinese Tokens with `ChineseCountVectorizer`
 
 The Chinese language does not separate tokens by whitespace, unlike most Indo-European languages.
 You thus need to use special tokenization rules for Chinese.
 Turftopic provides tools for Chinese tokenization via the [Jieba](https://github.com/fxsjy/jieba) package.
 
+!!! note
+    We recommend that you use Jieba over SpaCy for topic modeling with Chinese.
+
 You will need to install the package in order to be able to use our Chinese vectorizer.
 
 ```bash
@@ -213,6 +250,8 @@ model.print_topics()
 
 :::turftopic.vectorizers.spacy.LemmaCountVectorizer
 
+:::turftopic.vectorizers.spacy.TokenCountVectorizer
+
 :::turftopic.vectorizers.snowball.StemmingCountVectorizer
 
 :::turftopic.vectorizers.chinese.ChineseCountVectorizer
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -60,6 +60,8 @@ markdown_extensions:
   - admonition
   - pymdownx.details
   - pymdownx.superfences
+  - pymdownx.tabbed:
+      alternate_style: true
   - toc:
       toc_depth: 2
   - pymdownx.arithmatex: