Topic Distribution in Documents using BTM. #5

adjoshi81 · 2019-02-25T07:06:09Z

Hello jwijffels,

Thank you very much for creating the R implementation of BTM. I am using it for finding out topics in short texts (i.e. mainly tweets). I would like to know if we can identify the topic distribution within each short text, is this functionality available in the existing version of BTM?

In the original research paper by Yan et. al. : A Biterm Topic Model for Short Text under the Introduction section, mentions:
"However, we show that the topic distribution of each document can be naturally derived based on
the learned model".

Also, is there a way in which the number of topics can be identified through this package. This is not an issue but a possible feature request.

Thanks again for your inputs.

jwijffels · 2019-02-25T08:06:30Z

For getting the topic distribution within each short text, did you use the ?predict.BTM function already?
For finding the optimal number of topics. Currently the only measurement which is implemented is the likelihood how good each biterm is fitted by the model. See the help of ?logLik.BTM. You can see how this compares across different number of topics.
For other measures of topic quality, this is still open in issue #3

adjoshi81 · 2019-02-25T12:29:56Z

Thanks, the predict.BTM function did give the topic distribution across individual texts.

mevalerio · 2023-03-08T19:46:22Z

Hi @jwijffels , I am using BTM for a paper, thank you for your hard work it. I am thinking to use a entropy based measure to evaluate models when K changes. Anyway, I would like to assess it against “something” that pickups a word-based likelihood of belonging. I am not understanding how logLik.BTM can help. The more ll is close to zero (sum log of sum(phi[term1, ] * phi[term2, ] * theta), the better the model? I know I am abusing terminologies, apologies in advance.

manuelbickel · 2023-03-11T14:25:46Z

Hi, one option to measure topic quality are coherence metrics. Simply spoken, these metrics take the top x terms of a topic and check their statistical relation (different metrics) in the corpus to assess the quality of the set of terms. I have implemented some metrics in the text2vec package on the basis of a paper by Röder et al., but have no experience if they make sense for or work with biterm models. Probably for biterm models they perform worse due to the sparseness. I would really like to implement the metrics for udpipe to support the nice work by jwiffels, but simply lack the time to do so at the moment. Am 8. März 2023 20:46:34 MEZ schrieb mevalerio ***@***.***>:

…

Hi @jwijffels , I am using ``BTM`` for a paper, thank you for your hard work it. I am thinking to use a entropy based measure to evaluate models when K changes. Anyway, I would like to assess it against “something” that pickups a word-based likelihood of belonging. I am not understanding how ``logLik.BTM`` can help. The more ``ll`` is close to zero (sum of sum(phi[term1, ] * phi[term2, ] * theta), the better the model? I know I am abusing terminologies, apologies in advance. -- Reply to this email directly or view it on GitHub: #5 (comment) You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>

adjoshi81 closed this as completed Feb 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic Distribution in Documents using BTM. #5

Topic Distribution in Documents using BTM. #5

adjoshi81 commented Feb 25, 2019

jwijffels commented Feb 25, 2019 •

edited

Loading

adjoshi81 commented Feb 25, 2019

mevalerio commented Mar 8, 2023 •

edited

Loading

manuelbickel commented Mar 11, 2023 via email

Topic Distribution in Documents using BTM. #5

Topic Distribution in Documents using BTM. #5

Comments

adjoshi81 commented Feb 25, 2019

jwijffels commented Feb 25, 2019 • edited Loading

adjoshi81 commented Feb 25, 2019

mevalerio commented Mar 8, 2023 • edited Loading

manuelbickel commented Mar 11, 2023 via email

jwijffels commented Feb 25, 2019 •

edited

Loading

mevalerio commented Mar 8, 2023 •

edited

Loading