8
8
[ ![ Contributors] ( https://img.shields.io/github/contributors/bobxwu/fastopic )] ( https://github.com/bobxwu/fastopic/graphs/contributors/ )
9
9
10
10
11
- ** [ FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model (NeurIPS 2024) ] ( https://arxiv.org/pdf/2405.17978.pdf ) **
11
+ ** [[ NeurIPS 2024 ] FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model] ( https://arxiv.org/pdf/2405.17978.pdf ) **
12
12
[[ Video]] ( https://recorder-v3.slideslive.com/?share=95127&s=a3c72f9a-4147-4cf0-a7d0-d95e45320df8 )
13
13
[[ TowardsDataScience Blog]] ( https://medium.com/@xiaobaowu/easy-fast-and-effective-topic-modeling-for-beginners-with-fastopic-2836781765f0 )
14
14
[[ Huggingface Blog]] ( https://huggingface.co/blog/bobxwu/fastopic )
@@ -19,10 +19,10 @@ It leverages optimal transport between the document, topic, and word embeddings
19
19
20
20
If you want to use FASTopic, please cite our [ paper] ( https://arxiv.org/pdf/2405.17978.pdf ) as
21
21
22
- @article {wu2024fastopic,
23
- title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm },
24
- author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
25
- journal={arXiv preprint arXiv:2405.17978 },
22
+ @inproceedings {wu2024fastopic,
23
+ title={FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model },
24
+ author={Wu, Xiaobao and Nguyen, Thong Thanh and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
25
+ booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems },
26
26
year={2024}
27
27
}
28
28
@@ -39,6 +39,7 @@ https://github.com/user-attachments/assets/42fc1f2a-2dc9-49c0-baf2-97b6fd6aea70
39
39
- [ Quick Start] ( #quick-start )
40
40
- [ Usage] ( #usage )
41
41
- [ Try FASTopic on your dataset] ( #try-fastopic-on-your-dataset )
42
+ - [ Save and Load] ( #save-and-load )
42
43
- [ Topic info] ( #topic-info )
43
44
- [ Topic hierarchy] ( #topic-hierarchy )
44
45
- [ Topic weights] ( #topic-weights )
@@ -79,19 +80,19 @@ Discover topics from 20newsgroups with the topic number as `50`.
79
80
80
81
``` python
81
82
from fastopic import FASTopic
83
+ from topmost import Preprocess
82
84
from sklearn.datasets import fetch_20newsgroups
83
- from topmost.preprocessing import Preprocessing
84
85
85
- docs = fetch_20newsgroups(subset = ' all' , remove = (' headers' , ' footers' , ' quotes' ))[' data' ]
86
+ docs = fetch_20newsgroups(subset = ' all' , remove = (' headers' , ' footers' , ' quotes' ))[' data' ]
86
87
87
- preprocessing = Preprocessing (vocab_size = 10000 , stopwords = ' English ' )
88
+ preprocess = Preprocess (vocab_size = 10000 )
88
89
89
- model = FASTopic(50 , preprocessing )
90
- topic_top_words , doc_topic_dist = model.fit_transform(docs)
90
+ model = FASTopic(50 , preprocess )
91
+ top_words , doc_topic_dist = model.fit_transform(docs)
91
92
92
93
```
93
94
94
- ` topic_top_words ` is a list containing the top words of discovered topics.
95
+ ` top_words ` is a list containing the top words of discovered topics.
95
96
` doc_topic_dist ` is the topic distributions of documents (doc-topic distributions),
96
97
a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).
97
98
@@ -102,7 +103,7 @@ a numpy array with shape $N \times K$ (number of documents $N$ and number of top
102
103
103
104
``` python
104
105
from fastopic import FASTopic
105
- from topmost.preprocessing import Preprocessing
106
+ from topmost.preprocess import Preprocess
106
107
107
108
# Prepare your dataset.
108
109
docs = [
@@ -111,15 +112,29 @@ docs = [
111
112
]
112
113
113
114
# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc.
114
- # Pass your tokenizer as:
115
- # preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
116
- preprocessing = Preprocessing(stopwords = ' English' )
115
+ # preprocess = Preprocess(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
116
+ preprocess = Preprocess()
117
117
118
- model = FASTopic(50 , preprocessing )
119
- topic_top_words , doc_topic_dist = model.fit_transform(docs)
118
+ model = FASTopic(50 , preprocess )
119
+ top_words , doc_topic_dist = model.fit_transform(docs)
120
120
```
121
121
122
122
123
+ ### Save and Load
124
+
125
+ ``` python
126
+
127
+ path = " ./tmp/fastopic.zip"
128
+ model.save(path)
129
+
130
+ loaded_model = FASTopic.from_pretrained(path)
131
+ beta = loaded_model.get_beta()
132
+
133
+ doc_topic_dist = loaded_model.transform(docs)
134
+ # Keep training
135
+ loaded_model.fit_transform(docs, epochs = 1 )
136
+ ```
137
+
123
138
### Topic info
124
139
125
140
We can get the top words and their probabilities of a topic.
@@ -225,12 +240,12 @@ We summarize the frequently used APIs of FASTopic here. It's easier for you to l
225
240
226
241
1 . ** Meet the ` out of memory ` error. My GPU memory is not enough due to large datasets. What should I do?**
227
242
228
- You can try to set ` save_memory =True` and ` batch_size ` in FASTopic.
229
- ` batch_size ` should not be too small, otherwise it may damage performance.
243
+ You can try to set ` low_memory =True` and ` low_memory_batch_size ` in FASTopic.
244
+ ` low_memory_batch_size ` should not be too small, otherwise it may damage performance.
230
245
231
246
232
247
``` python
233
- model = FASTopic(50 , save_memory = True , batch_size = 2000 )
248
+ model = FASTopic(50 , low_memory = True , low_memory_batch_size = 2000 )
234
249
```
235
250
236
251
Or you can run FASTopic on the CPU as
@@ -278,10 +293,34 @@ We summarize the frequently used APIs of FASTopic here. It's easier for you to l
278
293
return embeddings
279
294
280
295
your_model = YourDocEmbedModel()
281
- FASTopic(50 , doc_embed_model = your_model)
296
+ model = FASTopic(50 , doc_embed_model = your_model)
297
+ ```
298
+
299
+ 5 . ** Can I use my own preprocess module? **
300
+
301
+ Yes! You can wrap your module and pass it to FASTopic:
302
+
303
+ ```python
304
+ class YourPreprocess :
305
+ def __init__ (self ):
306
+ ...
307
+
308
+ def preprocess (self , docs : List[str ]):
309
+ ...
310
+ train_bow = ...
311
+ vocab = ...
312
+
313
+ return {
314
+ " train_bow" : train_bow, # sparse matrix
315
+ " vocab" : vocab # List[str]
316
+ }
317
+
318
+ your_preprocess = YourPreprocess()
319
+ model = FASTopic(50 , preprocess = your_preprocess)
282
320
```
283
321
284
322
323
+
285
324
# # Contact
286
325
- We welcome your contributions to this project. Please feel free to submit pull requests.
287
326
- If you encounter any issues, please either directly contact ** Xiaobao Wu (xiaobao002@ e.ntu.edu.sg)** or leave an issue in the GitHub repo.
@@ -290,4 +329,3 @@ We summarize the frequently used APIs of FASTopic here. It's easier for you to l
290
329
# # Related Resources
291
330
- [** TopMost** ](https:// github.com/ bobxwu/ topmost): a topic modeling toolkit, including preprocessing, model training, and evaluations.
292
331
- [** A Survey on Neural Topic Models: Methods, Applications, and Challenges** ](https:// github.com/ BobXWu/ Paper- Neural- Topic- Models)
293
-
0 commit comments