|
16 | 16 | - Streamlined scikit-learn compatible API 🛠️ |
17 | 17 | - Easy topic interpretation 🔍 |
18 | 18 | - Automated topic naming with LLMs |
| 19 | + - Topic modeling with keyphrases :key: |
| 20 | + - Lemmatization and Stemming |
19 | 21 | - Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️ |
20 | 22 |
|
21 | 23 | > This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues. |
22 | 24 |
|
23 | | -### New in version 0.10.0 |
| 25 | +## New in version 0.11.0: Vectorizers Module |
24 | 26 |
|
25 | | -You can interactively explore clusters using `datamapplot` directly in Turftopic! |
26 | | -You will first have to install `datamapplot` for this to work. |
| 27 | +You can now use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**. |
27 | 28 |
|
28 | 29 | ```python |
29 | | -from turftopic import ClusteringTopicModel |
30 | | -from turftopic.namers import OpenAITopicNamer |
| 30 | +from turftopic import KeyNMF |
| 31 | +from turftopic.vectorizers.spacy import NounPhraseCountVectorizer |
31 | 32 |
|
32 | | -model = ClusteringTopicModel(feature_importance="centroid") |
| 33 | +model = KeyNMF( |
| 34 | + n_components=10, |
| 35 | + vectorizer=NounPhraseCountVectorizer("en_core_web_sm"), |
| 36 | +) |
33 | 37 | model.fit(corpus) |
34 | | - |
35 | | -namer = OpenAITopicNamer("gpt-4o-mini") |
36 | | -model.rename_topics(namer) |
37 | | - |
38 | | -fig = model.plot_clusters_datamapplot() |
39 | | -fig.save("clusters_visualization.html") |
40 | | -fig |
| 38 | +model.print_topics() |
41 | 39 | ``` |
42 | | -> If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure. |
43 | | -
|
44 | | -<figure> |
45 | | - <img src="docs/images/cluster_datamapplot.png" width="70%" style="margin-left: auto;margin-right: auto;"> |
46 | | - <figcaption>Interactive figure to explore cluster structure in a clustering topic model.</figcaption> |
47 | | -</figure> |
48 | 40 |
|
49 | | -### New in version 0.9.0 |
| 41 | +| Topic ID | Highest Ranking | |
| 42 | +| - | - | |
| 43 | +| | ... | |
| 44 | +| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism | |
| 45 | +| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index | |
| 46 | +| | ... | |
50 | 47 |
|
51 | | -#### Dynamic S³ 🧭 |
| 48 | +Turftopic now also comes with a **Chinese vectorizer** for easier use, as well as a generalist **multilingual vectorizer**. |
52 | 49 |
|
53 | | -You can now use Semantic Signal Separation in a dynamic fashion. |
54 | | -This allows you to investigate how semantic axes fluctuate over time, and how their content changes. |
55 | 50 | ```python |
56 | | -from turftopic import SemanticSignalSeparation |
| 51 | +from turftopic.vectorizers.chinese import default_chinese_vectorizer |
| 52 | +from turftopic.vectorizers.spacy import TokenCountVectorizer |
57 | 53 |
|
58 | | -model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10) |
| 54 | +chinese_vectorizer = default_chinese_vectorizer() |
| 55 | +arabic_vectorizer = TokenCountVectorizer("ar", remove_stopwords=True) |
| 56 | +danish_vectorizer = TokenCountVectorizer("da", remove_stopwords=True) |
| 57 | +... |
59 | 58 |
|
60 | | -model.plot_topics_over_time() |
61 | 59 | ``` |
62 | 60 |
|
63 | 61 |
|
|
0 commit comments