Skip to content

Commit 6dfbcfb

Browse files
Merge pull request #72 from x-tabdeveloping/dynamic_s3
Dynamic $S^3$
2 parents 1da38f1 + 15f08be commit 6dfbcfb

File tree

9 files changed

+323
-26
lines changed

9 files changed

+323
-26
lines changed

README.md

Lines changed: 30 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -20,30 +20,20 @@
2020

2121
> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
2222
23-
### New in version 0.8.0
23+
### New in version 0.9.0
2424

25-
#### Automated Topic Naming
26-
27-
Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!
25+
#### Dynamic S³ 🧭
2826

27+
You can now use Semantic Signal Separation in a dynamic fashion.
28+
This allows you to investigate how semantic axes fluctuate over time, and how their content changes.
2929
```python
30-
from turftopic import KeyNMF
31-
from turftopic.namers import OpenAITopicNamer
30+
from turftopic import SemanticSignalSeparation
3231

33-
model = KeyNMF(10).fit(corpus)
32+
model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10)
3433

35-
namer = OpenAITopicNamer("gpt-4o-mini")
36-
model.rename_topics(namer)
37-
model.print_topics()
34+
model.plot_topics_over_time()
3835
```
3936

40-
| Topic ID | Topic Name | Highest Ranking |
41-
| - | - | - |
42-
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
43-
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
44-
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
45-
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
46-
| | ... |
4737

4838
## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
4939
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
@@ -143,6 +133,29 @@ model.print_topic_distribution(
143133

144134
</center>
145135

136+
#### Automated Topic Naming
137+
138+
Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!
139+
140+
```python
141+
from turftopic import KeyNMF
142+
from turftopic.namers import OpenAITopicNamer
143+
144+
model = KeyNMF(10).fit(corpus)
145+
146+
namer = OpenAITopicNamer("gpt-4o-mini")
147+
model.rename_topics(namer)
148+
model.print_topics()
149+
```
150+
151+
| Topic ID | Topic Name | Highest Ranking |
152+
| - | - | - |
153+
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
154+
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
155+
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
156+
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
157+
| | ... |
158+
146159
### Visualization
147160

148161
Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.

docs/KeyNMF.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -221,12 +221,12 @@ pip install plotly
221221
```
222222

223223
```python
224-
model.plot_topics_over_time(top_k=5)
224+
model.plot_topics_over_time()
225225
```
226226

227227
<figure>
228-
<img src="../images/dynamic_keynmf.png" width="50%" style="margin-left: auto;margin-right: auto;">
229-
<figcaption>Topics over time on a Figure</figcaption>
228+
<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
229+
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
230230
</figure>
231231

232232
### Online Topic Modeling

docs/dynamic.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,15 @@ In Turftopic you can currently use three different topic models for modeling top
1111
1. [ClusteringTopicModel](clustering.md), where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
1212
2. [GMM](GMM.md), similarly to clustering models, term importances are reestimated per time slice
1313
3. [KeyNMF](KeyNMF.md), an overall decomposition is done, then using coordinate descent, topic-term-matrices are recalculated based on document-topic importances in the given time slice.
14+
4. [SemanticSignalSeparation](s3.md), a global model is fitted and then local models are inferred using linear regression from embeddings and document-topic signals in a given time-slice.
1415

1516
## Usage
1617

1718
Dynamic topic models in Turftopic have a unified interface.
1819
To fit a dynamic topic model you will need a corpus, that has been annotated with timestamps.
1920
The timestamps need to be Python `datetime` objects, but pandas `Timestamp` object are also supported.
2021

21-
Models that have dynamic modeling capabilities (`KeyNMF`, `GMM` and `ClusteringTopicModel`) have a `fit_transform_dynamic()` method, that fits the model on the corpus over time.
22+
Models that have dynamic modeling capabilities (`KeyNMF`, `GMM`, `SemanticSignalSeparation` and `ClusteringTopicModel`) have a `fit_transform_dynamic()` method, that fits the model on the corpus over time.
2223

2324
```python
2425
from datetime import datetime
@@ -69,12 +70,12 @@ pip install plotly
6970
```
7071

7172
```python
72-
model.plot_topics_over_time(top_k=5)
73+
model.plot_topics_over_time()
7374
```
7475

7576
<figure>
76-
<img src="../images/dynamic_keynmf.png" width="80%" style="margin-left: auto;margin-right: auto;">
77-
<figcaption>Topics over time on a Figure</figcaption>
77+
<iframe src="../images/dynamic_keynmf.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
78+
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
7879
</figure>
7980

8081
## API reference

docs/images/dynamic_keynmf.html

Lines changed: 14 additions & 0 deletions
Large diffs are not rendered by default.

docs/images/dynamic_s3.html

Lines changed: 14 additions & 0 deletions
Large diffs are not rendered by default.

docs/images/dynamic_s3.png

134 KB
Loading

docs/s3.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,39 @@ Based on our evaluations, however, we recommend that you use axial or combined t
5656
Axial topics tend to result in the most coherent topics, while angular topics result in the most distinct ones.
5757
The combined approach is a reasonable compromise between the two methods, and is thus the default.
5858

59+
### Dynamic Topic Modeling *(Optional)*
60+
61+
$S^3$ can also be used as a dynamic topic model.
62+
Temporally changing components are found using the following steps:
63+
64+
1. Fit a global $S^3$ model over the whole corpus.
65+
2. Estimate unmixing matrix for each time-slice by fitting a linear regression from the embeddings in the time slice to the document-topic-matrix for the time slice estimated by the global model.
66+
3. Estimate term importances for each time slice the same way as the global model.
67+
68+
```python
69+
from datetime import datetime
70+
from turftopic import SemanticSignalSeparation
71+
72+
ts: list[datetime] = [datetime(year=2018, month=2, day=12), ...]
73+
corpus: list[str] = ["First document", ...]
74+
75+
model = SemanticSignalSeparation(10).fit_dynamic(corpus, timestamps=ts, bins=10)
76+
model.plot_topics_over_time()
77+
```
78+
79+
!!! info
80+
Topics over time in $S^3$ are treated slightly differently to most other models.
81+
This is because topics are not proportional in $S^3$, and can tip below zero.
82+
In the timeslices where a topic is below zero, its **negative definition** is displayed.
83+
84+
85+
86+
<figure>
87+
<iframe src="../images/dynamic_s3.html", title="Topics over time", style="height:800px;width:1000px;padding:0px;border:none;"></iframe>
88+
<figcaption> Topics over time in a dynamic Semantic Signal Separation model. </figcaption>
89+
</figure>
90+
91+
5992
## Model Refitting
6093

6194
Unlike most other models in Turftopic, $S^3$ can be refit using different parameters and random seeds without needing to initialize the model from scratch.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ line-length=79
66

77
[tool.poetry]
88
name = "turftopic"
9-
version = "0.8.1"
9+
version = "0.9.0"
1010
description = "Topic modeling with contextual representations from sentence transformers."
1111
authors = ["Márton Kardos <[email protected]>"]
1212
license = "MIT"

0 commit comments

Comments
 (0)