|
3 | 3 | KeyNMF is a topic model that relies on contextually sensitive embeddings for keyword retrieval and term importance estimation, |
4 | 4 | while taking inspiration from classical matrix-decomposition approaches for extracting topics. |
5 | 5 |
|
6 | | -## The Model |
7 | | - |
8 | 6 | <figure> |
9 | 7 | <img src="../images/keynmf.png" width="90%" style="margin-left: auto;margin-right: auto;"> |
10 | 8 | <figcaption>Schematic overview of KeyNMF</figcaption> |
11 | 9 | </figure> |
12 | 10 |
|
13 | | -### 1. Keyword Extraction |
| 11 | +Here's an example of how you can fit and interpret a KeyNMF model in the easiest way. |
| 12 | + |
| 13 | +```python |
| 14 | +from turftopic import KeyNMF |
| 15 | + |
| 16 | +model = KeyNMF(10, top_n=6) |
| 17 | +model.fit(corpus) |
| 18 | + |
| 19 | +model.print_topics() |
| 20 | +``` |
| 21 | + |
| 22 | +## Keyword Extraction |
14 | 23 |
|
15 | 24 | The first step of the process is gaining enhanced representations of documents by using contextual embeddings. |
16 | 25 | Both the documents and the vocabulary get encoded with the same sentence encoder. |
17 | | - |
18 | 26 | Keywords are assigned to each document based on the cosine similarity of the document embedding to the embedded words in the document. |
19 | 27 | Only the top K words with positive cosine similarity to the document are kept. |
20 | | - |
21 | 28 | These keywords are then arranged into a document-term importance matrix where each column represents a keyword that was encountered in at least one document, |
22 | 29 | and each row is a document. The entries in the matrix are the cosine similarities of the given keyword to the document in semantic space. |
23 | 30 |
|
24 | | -Keyword extraction can be performed by computing cosine similarities between document embeddings and embeddings of the entire vocabulary, |
25 | | -or between document embeddings and words that occur within each document. The former scenario allows for multilingual topics. |
| 31 | +- For each document $d$: |
| 32 | + 1. Let $x_d$ be the document's embedding produced with the encoder model. |
| 33 | + 2. For each word $w$ in the document $d$: |
| 34 | + 1. Let $v_w$ be the word's embedding produced with the encoder model. |
| 35 | + 2. Calculate cosine similarity between word and document |
| 36 | + |
| 37 | + $$ |
| 38 | + \text{sim}(d, w) = \frac{x_d \cdot v_w}{||x_d|| \cdot ||v_w||} |
| 39 | + $$ |
| 40 | + |
| 41 | + 3. Let $K_d$ be the set of $N$ keywords with the highest cosine similarity to document $d$. |
| 42 | + |
| 43 | + $$ |
| 44 | + K_d = \text{argmax}_{K^*} \sum_{w \in K^*}\text{sim}(d,w)\text{, where } |
| 45 | + |K_d| = N\text{, and } \\ |
| 46 | + w \in d |
| 47 | + $$ |
26 | 48 |
|
27 | | -### 2. Topic Discovery |
| 49 | +- Arrange positive keyword similarities into a keyword matrix $M$ where the rows represent documents, and columns represent unique keywords. |
| 50 | + |
| 51 | + $$ |
| 52 | + M_{dw} = |
| 53 | + \begin{cases} |
| 54 | + \text{sim}(d,w), & \text{if } w \in K_d \text{ and } \text{sim}(d,w) > 0 \\ |
| 55 | + 0, & \text{otherwise}. |
| 56 | + \end{cases} |
| 57 | + $$ |
| 58 | + |
| 59 | +You can do this step manually if you want to precompute the keyword matrix. |
| 60 | +Keywords are represented as dictionaries mapping words to keyword importances. |
| 61 | + |
| 62 | +```python |
| 63 | +model.extract_keywords(["Cars are perhaps the most important invention of the last couple of centuries. They have revolutionized transportation in many ways."]) |
| 64 | +``` |
| 65 | + |
| 66 | +```python |
| 67 | +[{'transportation': 0.44713873, |
| 68 | + 'invention': 0.560524, |
| 69 | + 'cars': 0.5046208, |
| 70 | + 'revolutionized': 0.3339205, |
| 71 | + 'important': 0.21803442}] |
| 72 | +``` |
| 73 | + |
| 74 | +A precomputed Keyword matrix can also be used to fit a model: |
| 75 | + |
| 76 | +```python |
| 77 | +keyword_matrix = model.extract_keywords(corpus) |
| 78 | +model.fit(keywords=keyword_matrix) |
| 79 | +``` |
| 80 | + |
| 81 | +## Topic Discovery |
28 | 82 |
|
29 | 83 | Topics in this matrix are then discovered using Non-negative Matrix Factorization. |
30 | 84 | Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance |
31 | 85 | can be explained. |
32 | 86 |
|
33 | | -### _(Optional)_ 3. Dynamic Modeling |
| 87 | +- Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss: |
| 88 | + |
| 89 | + $$ |
| 90 | + L(W,H) = ||M - WH||^2 |
| 91 | + $$ |
| 92 | + |
| 93 | +You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords. |
| 94 | +```python |
| 95 | +# Fitting just on the corpus |
| 96 | +model.fit(corpus) |
| 97 | + |
| 98 | +# Fitting with precomputed embeddings |
| 99 | +from sentence_transformers import SentenceTransformer |
| 100 | + |
| 101 | +trf = SentenceTransformer("all-MiniLM-L6-v2") |
| 102 | +embeddings = trf.encode(corpus) |
| 103 | + |
| 104 | +model = KeyNMF(10, encoder=trf) |
| 105 | +model.fit(corpus, embeddings=embeddings) |
| 106 | + |
| 107 | +# Fitting with precomputed keyword matrix |
| 108 | +keyword_matrix = model.extract_keywords(corpus) |
| 109 | +model.fit(keywords=keyword_matrix) |
| 110 | +``` |
| 111 | + |
| 112 | +## Dynamic Topic Modeling |
34 | 113 |
|
35 | 114 | KeyNMF is also capable of modeling topics over time. |
36 | 115 | This happens by fitting a KeyNMF model first on the entire corpus, then |
37 | 116 | fitting individual topic-term matrices using coordinate descent based on the document-topic and document-term matrices in the given time slices. |
38 | 117 |
|
| 118 | +1. Compute keyword matrix $M$ for the whole corpus. |
| 119 | +2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$. |
| 120 | +3. For each time slice $t$: |
| 121 | + 1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$. |
| 122 | + 2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$: |
| 123 | + |
| 124 | + $$ |
| 125 | + H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2 |
| 126 | + $$ |
| 127 | + |
| 128 | +Here's an example of using KeyNMF in a dynamic modeling setting: |
| 129 | + |
| 130 | +```python |
| 131 | +from datetime import datetime |
| 132 | + |
| 133 | +from turftopic import KeyNMF |
| 134 | + |
| 135 | +corpus: list[str] = [] |
| 136 | +timestamps: list[datetime] = [] |
| 137 | + |
| 138 | +model = KeyNMF(5, top_n=5, random_state=42) |
| 139 | +document_topic_matrix = model.fit_transform_dynamic( |
| 140 | + corpus, timestamps=timestamps, bins=10 |
| 141 | +) |
| 142 | +``` |
| 143 | + |
| 144 | +You can use the `print_topics_over_time()` method for producing a table of the topics over the generated time slices. |
| 145 | + |
| 146 | +> This example uses CNN news data. |
| 147 | +
|
| 148 | +```python |
| 149 | +model.print_topics_over_time() |
| 150 | +``` |
| 151 | + |
| 152 | +<center> |
| 153 | + |
| 154 | +| Time Slice | 0_olympics_tokyo_athletes_beijing | 1_covid_vaccine_pandemic_coronavirus | 2_olympic_athletes_ioc_athlete | 3_djokovic_novak_tennis_federer | 4_ronaldo_cristiano_messi_manchester | |
| 155 | +| - | - | - | - | - | - | |
| 156 | +| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn | |
| 157 | +| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player | |
| 158 | +| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave | |
| 159 | +| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd | |
| 160 | +| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape | |
| 161 | +| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape | |
| 162 | +| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers | |
| 163 | +| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence | |
| 164 | +| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona | |
| 165 | +| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored | |
| 166 | + |
| 167 | +</center> |
| 168 | + |
| 169 | +You can also display the topics over time on an interactive HTML figure. |
| 170 | +The most important words for topics get revealed by hovering over them. |
| 171 | + |
| 172 | +> You will need to install Plotly for this to work. |
| 173 | +
|
| 174 | +```bash |
| 175 | +pip install plotly |
| 176 | +``` |
| 177 | + |
| 178 | +```python |
| 179 | +model.plot_topics_over_time(top_k=5) |
| 180 | +``` |
| 181 | + |
| 182 | +<figure> |
| 183 | + <img src="../images/dynamic_keynmf.png" width="80%" style="margin-left: auto;margin-right: auto;"> |
| 184 | + <figcaption>Topics over time on a Figure</figcaption> |
| 185 | +</figure> |
| 186 | + |
| 187 | +## Online Topic Modeling |
| 188 | + |
| 189 | +KeyNMF can also be fitted in an online manner. |
| 190 | +This is done by fitting NMF with batches of data instead of the whole dataset at once. |
| 191 | + |
| 192 | +#### Use Cases: |
| 193 | + |
| 194 | +1. You can use online fitting when you have **very large corpora** at hand, and it would be impractical to fit a model on it at once. |
| 195 | +2. You have **new data flowing in constantly**, and need a model that can morph the topics based on the incoming data. You can also do this in a dynamic fashion. |
| 196 | +3. You need to **finetune** an already fitted topic model to novel data. |
| 197 | + |
| 198 | +#### Batch Fitting |
| 199 | + |
| 200 | +We will use the batching function from the itertools recipes to produce batches. |
| 201 | + |
| 202 | +> In newer versions of Python (>=3.12) you can just `from itertools import batched` |
| 203 | +
|
| 204 | +```python |
| 205 | +def batched(iterable, n: int): |
| 206 | + "Batch data into lists of length n. The last batch may be shorter." |
| 207 | + if n < 1: |
| 208 | + raise ValueError("n must be at least one") |
| 209 | + it = iter(iterable) |
| 210 | + while batch := tuple(itertools.islice(it, n)): |
| 211 | + yield batch |
| 212 | +``` |
| 213 | + |
| 214 | +You can fit a KeyNMF model to a very large corpus in batches like so: |
| 215 | + |
| 216 | +```python |
| 217 | +from turftopic import KeyNMF |
| 218 | + |
| 219 | +model = KeyNMF(10, top_n=5) |
| 220 | + |
| 221 | +corpus = ["some string", "etc", ...] |
| 222 | +for batch in batched(corpus, 200): |
| 223 | + batch = list(batch) |
| 224 | + model.partial_fit(batch) |
| 225 | +``` |
| 226 | + |
| 227 | +#### Precomputing the Keyword Matrix |
| 228 | + |
| 229 | +If you desire the best results, it might make sense for you to go over the corpus in multiple epochs: |
| 230 | + |
| 231 | +```python |
| 232 | +for epoch in range(5): |
| 233 | + for batch in batched(corpus, 200): |
| 234 | + model.partial_fit(batch) |
| 235 | +``` |
| 236 | + |
| 237 | +This is mildly inefficient, however, as the texts need to be encoded on every epoch, and keywords need to be extracted. |
| 238 | +In such scenarios you might want to precompute and maybe even save the extracted keywords to disk using the `extract_keywords()` method. |
| 239 | + |
| 240 | +Keywords are represented as dictionaries mapping words to keyword importances. |
| 241 | + |
| 242 | +```python |
| 243 | +model.extract_keywords(["Cars are perhaps the most important invention of the last couple of centuries. They have revolutionized transportation in many ways."]) |
| 244 | +``` |
| 245 | + |
| 246 | +```python |
| 247 | +[{'transportation': 0.44713873, |
| 248 | + 'invention': 0.560524, |
| 249 | + 'cars': 0.5046208, |
| 250 | + 'revolutionized': 0.3339205, |
| 251 | + 'important': 0.21803442}] |
| 252 | +``` |
| 253 | + |
| 254 | +You can extract keywords in batches and save them to disk to a file format of your choice. |
| 255 | +In this example I will use NDJSON because of its simplicity. |
| 256 | + |
| 257 | +```python |
| 258 | +import json |
| 259 | +from pathlib import Path |
| 260 | +from typing import Iterable |
| 261 | + |
| 262 | +# Here we are saving keywords to a JSONL/NDJSON file |
| 263 | +with Path("keywords.jsonl").open("w") as keyword_file: |
| 264 | + # Doing this in batches is much more efficient than individual texts because |
| 265 | + # of the encoding. |
| 266 | + for batch in batched(corpus, 200): |
| 267 | + batch_keywords = model.extract_keywords(batch) |
| 268 | + # We serialize each |
| 269 | + for keywords in batch_keywords: |
| 270 | + keyword_file.write(json.dumps(keywords) + "\n") |
| 271 | + |
| 272 | +def stream_keywords() -> Iterable[dict[str, float]]: |
| 273 | + """This function streams keywords from the file.""" |
| 274 | + with Path("keywords.jsonl").open() as keyword_file: |
| 275 | + for line in keyword_file: |
| 276 | + yield json.loads(line.strip()) |
| 277 | + |
| 278 | +for epoch in range(5): |
| 279 | + keyword_stream = stream_keywords() |
| 280 | + for keyword_batch in batched(keyword_stream, 200): |
| 281 | + model.partial_fit(keywords=keyword_batch) |
| 282 | +``` |
| 283 | + |
| 284 | +#### Dynamic Online Topic Modeling |
| 285 | + |
| 286 | +KeyNMF can be online fitted in a dynamic manner as well. |
| 287 | +This is useful when you have large corpora of text over time, or when you want to fit the model on future information flowing in |
| 288 | +and want to analyze the topics' changes over time. |
| 289 | + |
| 290 | +When using dynamic online topic modeling you have to predefine the time bins that you will use, as the model can't infer these from the data. |
| 291 | + |
| 292 | +```python |
| 293 | +from datetime import datetime |
| 294 | + |
| 295 | +# We will bin by years in a period of 2020-2030 |
| 296 | +bins = [datetime(year=y, month=1, day=1) for y in range(2020, 2030 + 2, 1)] |
| 297 | +``` |
| 298 | + |
| 299 | +You can then online fit a dynamic topic model with `partial_fit_dynamic()`. |
| 300 | + |
| 301 | +```python |
| 302 | +model = KeyNMF(5, top_n=10) |
| 303 | + |
| 304 | +corpus: list[str] = [...] |
| 305 | +timestamps: list[datetime] = [...] |
| 306 | + |
| 307 | +for batch in batched(zip(corpus, timestamps)): |
| 308 | + text_batch, ts_batch = zip(*batch) |
| 309 | + model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins) |
| 310 | +``` |
| 311 | + |
39 | 312 | ## Considerations |
40 | 313 |
|
41 | 314 | ### Strengths |
|
0 commit comments