Skip to content

Commit 1612c15

Browse files
committed
v1.0.0
1 parent 7a75601 commit 1612c15

10 files changed

+470
-152
lines changed

README.md

+60-22
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)
99

1010

11-
**[FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model (NeurIPS 2024)](https://arxiv.org/pdf/2405.17978.pdf)**
11+
**[[NeurIPS 2024] FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model](https://arxiv.org/pdf/2405.17978.pdf)**
1212
[[Video]](https://recorder-v3.slideslive.com/?share=95127&s=a3c72f9a-4147-4cf0-a7d0-d95e45320df8)
1313
[[TowardsDataScience Blog]](https://medium.com/@xiaobaowu/easy-fast-and-effective-topic-modeling-for-beginners-with-fastopic-2836781765f0)
1414
[[Huggingface Blog]](https://huggingface.co/blog/bobxwu/fastopic)
@@ -19,10 +19,10 @@ It leverages optimal transport between the document, topic, and word embeddings
1919

2020
If you want to use FASTopic, please cite our [paper](https://arxiv.org/pdf/2405.17978.pdf) as
2121

22-
@article{wu2024fastopic,
23-
title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
24-
author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
25-
journal={arXiv preprint arXiv:2405.17978},
22+
@inproceedings{wu2024fastopic,
23+
title={FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model},
24+
author={Wu, Xiaobao and Nguyen, Thong Thanh and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
25+
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
2626
year={2024}
2727
}
2828

@@ -39,6 +39,7 @@ https://github.com/user-attachments/assets/42fc1f2a-2dc9-49c0-baf2-97b6fd6aea70
3939
- [Quick Start](#quick-start)
4040
- [Usage](#usage)
4141
- [Try FASTopic on your dataset](#try-fastopic-on-your-dataset)
42+
- [Save and Load](#save-and-load)
4243
- [Topic info](#topic-info)
4344
- [Topic hierarchy](#topic-hierarchy)
4445
- [Topic weights](#topic-weights)
@@ -79,19 +80,19 @@ Discover topics from 20newsgroups with the topic number as `50`.
7980

8081
```python
8182
from fastopic import FASTopic
83+
from topmost import Preprocess
8284
from sklearn.datasets import fetch_20newsgroups
83-
from topmost.preprocessing import Preprocessing
8485

85-
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
86+
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
8687

87-
preprocessing = Preprocessing(vocab_size=10000, stopwords='English')
88+
preprocess = Preprocess(vocab_size=10000)
8889

89-
model = FASTopic(50, preprocessing)
90-
topic_top_words, doc_topic_dist = model.fit_transform(docs)
90+
model = FASTopic(50, preprocess)
91+
top_words, doc_topic_dist = model.fit_transform(docs)
9192

9293
```
9394

94-
`topic_top_words` is a list containing the top words of discovered topics.
95+
`top_words` is a list containing the top words of discovered topics.
9596
`doc_topic_dist` is the topic distributions of documents (doc-topic distributions),
9697
a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).
9798

@@ -102,7 +103,7 @@ a numpy array with shape $N \times K$ (number of documents $N$ and number of top
102103

103104
```python
104105
from fastopic import FASTopic
105-
from topmost.preprocessing import Preprocessing
106+
from topmost.preprocess import Preprocess
106107

107108
# Prepare your dataset.
108109
docs = [
@@ -111,15 +112,29 @@ docs = [
111112
]
112113

113114
# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc.
114-
# Pass your tokenizer as:
115-
# preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
116-
preprocessing = Preprocessing(stopwords='English')
115+
# preprocess = Preprocess(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
116+
preprocess = Preprocess()
117117

118-
model = FASTopic(50, preprocessing)
119-
topic_top_words, doc_topic_dist = model.fit_transform(docs)
118+
model = FASTopic(50, preprocess)
119+
top_words, doc_topic_dist = model.fit_transform(docs)
120120
```
121121

122122

123+
### Save and Load
124+
125+
```python
126+
127+
path = "./tmp/fastopic.zip"
128+
model.save(path)
129+
130+
loaded_model = FASTopic.from_pretrained(path)
131+
beta = loaded_model.get_beta()
132+
133+
doc_topic_dist = loaded_model.transform(docs)
134+
# Keep training
135+
loaded_model.fit_transform(docs, epochs=1)
136+
```
137+
123138
### Topic info
124139

125140
We can get the top words and their probabilities of a topic.
@@ -225,12 +240,12 @@ We summarize the frequently used APIs of FASTopic here. It's easier for you to l
225240

226241
1. **Meet the `out of memory` error. My GPU memory is not enough due to large datasets. What should I do?**
227242

228-
You can try to set `save_memory=True` and `batch_size` in FASTopic.
229-
`batch_size` should not be too small, otherwise it may damage performance.
243+
You can try to set `low_memory=True` and `low_memory_batch_size` in FASTopic.
244+
`low_memory_batch_size` should not be too small, otherwise it may damage performance.
230245

231246

232247
```python
233-
model = FASTopic(50, save_memory=True, batch_size=2000)
248+
model = FASTopic(50, low_memory=True, low_memory_batch_size=2000)
234249
```
235250

236251
Or you can run FASTopic on the CPU as
@@ -278,10 +293,34 @@ We summarize the frequently used APIs of FASTopic here. It's easier for you to l
278293
return embeddings
279294

280295
your_model = YourDocEmbedModel()
281-
FASTopic(50, doc_embed_model=your_model)
296+
model = FASTopic(50, doc_embed_model=your_model)
297+
```
298+
299+
5. **Can I use my own preprocess module?**
300+
301+
Yes! You can wrap your module and pass it to FASTopic:
302+
303+
```python
304+
class YourPreprocess:
305+
def __init__(self):
306+
...
307+
308+
def preprocess(self, docs: List[str]):
309+
...
310+
train_bow = ...
311+
vocab = ...
312+
313+
return {
314+
"train_bow": train_bow, # sparse matrix
315+
"vocab": vocab # List[str]
316+
}
317+
318+
your_preprocess = YourPreprocess()
319+
model = FASTopic(50, preprocess=your_preprocess)
282320
```
283321

284322

323+
285324
## Contact
286325
- We welcome your contributions to this project. Please feel free to submit pull requests.
287326
- If you encounter any issues, please either directly contact **Xiaobao Wu (xiaobao002@e.ntu.edu.sg)** or leave an issue in the GitHub repo.
@@ -290,4 +329,3 @@ We summarize the frequently used APIs of FASTopic here. It's easier for you to l
290329
## Related Resources
291330
- [**TopMost**](https://github.com/bobxwu/topmost): a topic modeling toolkit, including preprocessing, model training, and evaluations.
292331
- [**A Survey on Neural Topic Models: Methods, Applications, and Challenges**](https://github.com/BobXWu/Paper-Neural-Topic-Models)
293-

0 commit comments

Comments
 (0)