-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add measures of topic quality #3
Comments
@manuelbickel I see you have been working on some of these measures for the text2vec package. Are you interested in some extra work for this biterm topic model package? |
Hi, thank you for your interest in my recent work. I will have to finalize some work for my PhD thesis in the next two months, but afterwards I could try to provide support. I think it should not be too difficult to use the metrics implemented in text2vec so far for the biterm model. The input required for coherence metrics are "just" the n top topic terms and a reference corpus to build a reference tcm from. To my knowledge (which is limited since I am not a computer scientist) coherence metrics have so far been applied in the context of "normal" text in contrast to the shorter texts BTM is aiming at - so we might have to check how good the metrics work in this context - but it should probably be fine if a suitable reference corpus of similar nature as the texts is selected, I guess. |
That would be great! |
Just a reminder for the time when we get to work on this in detail... For the cluster distance metrics, the Jensen Shannon divergence is needed, which has already been implemented in LDAvis package here. We can use this... |
Hi jwijffels, Thank you very much for BTM. I am trying to find out the optimal number of topics from a corpus of tweets, my approach: Could you please let me know what you think, I'd appreciate it much. Thank you very much |
No comment. |
Thanks. I wish I had the ability to comment and help, I would do it. |
First of all, sorry, that I still have not proceeded with coherence metrics for BTM - since my current main job has nothing to do with programming it's difficult to find time. Still on my list. Now to the question of @hg-wells: I used loglik and coherence metrics with a comparably small but thematically very specific reference corpus. The metrics all gave different answers regarding the "best" number of topics since they measure different qualities. The best approach to my current knowledge would is creating models with different number of topics and also different hyerparameters, the use different reference corpora and different metrics to evaluate the results. Then pick out the "best" models that these metrics propose and inspect them manually by using expert knowledge. A lot of work, I know... I think pure automatic detection works in some contexts, but not in all. Texts represent meaning, it depends what kind of meaning you are searching for - there is no single correct answer. Imagine all these topic modelling algorithms as text coders like in the social sciences - you need to find out, which one you trust the most, this can vary depending on the task. I hope that helps a bit, at least. |
Hi,
Thank you so much for your answer, I really appreciate it.
I find your paper very interesting. As I said, I am interested in biterm
models at the moment since I am working on tweets. But your code on LDA is
of enormous help, thank you so much for sharing it. will look into that and
try to work out coherence and apply it to BTM. I also appreciate the
description of your approach. I will attempt to replicate the approach you
have used in your paper.
Again, thank you so much for all your help, I really appreciate it.
…On Sun, 26 Apr 2020 at 14:59, Manuel Bickel ***@***.***> wrote:
First of all, sorry, that I still have not proceeded with coherence
metrics for BTM - since my current main job has nothing to do with
programming it's difficult to find time. Still on my list.
Now to the question of @hg-wells <https://github.com/hg-wells>:
Generating many models with varying number of topics and finding the
"best" makes sense from my perspective. I have published a paper using this
approach based on the excellent text2vec package by dselivanov - so not
BTM, but standard LDA, see paper and vignette here
<https://github.com/manuelbickel/textility>; the code is unpolished and
package is not installation ready, do not hesitate to contact me for
questions.
I used loglik and coherence metrics with a comparably small but
thematically very specific reference corpus. The metrics all gave different
answers regarding the "best" number of topics since they measure different
qualities.
The best approach to my current knowledge would is creating models with
different number of topics and also different hyerparameters, the use
different reference corpora and different metrics to evaluate the results.
Then pick out the "best" models that these metrics propose and inspect them
manually by using expert knowledge. A lot of work, I know... I think pure
automatic detection works in some contexts, but not in all. Texts represent
meaning, it depends what kind of meaning you are searching for - there is
no single correct answer. Imagine all these topic modelling algorithms as
text coders like in the social sciences - you need to find out, which one
you trust the most, this can vary depending on the task.
I hope that helps a bit, at least.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABH6WV4INYYJLLGLIXQDNT3ROQ44ZANCNFSM4GM2IKKA>
.
|
Hi, ` Exclusivityexclusivity <- function (model, M = 30, frexw = 0.7){ Run BTM and compute exclusivityinstall.packages("BTM") |
@hg-wells Thanks for the contribution and having taken the time for this |
Hi, thank you for your reply and interest. I will surely do it, it may be
in a few days but I am happy to move on with it!
…On Wed, 27 May 2020 at 19:55, jwijffels ***@***.***> wrote:
@hg-wells <https://github.com/hg-wells> Thanks for the contribution and
having taken the time for this
Before proceeding
Can you create a pull request where you put the R code in the package,
document it, provide an example and show the expected behaviour of the
function and possibly a test using the tinytest package
I've just set up travis CI such that we can see from there if there are
any issues
thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABH6WV6IYBVOMSPLAVQHIBLRTVOZVANCNFSM4GM2IKKA>
.
|
Hi @jwijffels and @hg-wells, For semantic coherence of biterm topic models, the function is as follows:
The necessary Document-Term-Matrix (DTM) can be calculated with:
where x is:
(the example from https://github.com/bnosac/BTM) We provided this function in our supplementary materials (http://dx.doi.org/10.23668/psycharchives.4372; unfortunately only working with BTM prior to version 0.3.2), where we made use of @hg-wells exclusivity function as well. Thx for that! |
We made our code available under LGPL 3.0 licence, which should be compatible with apache license 2.0. |
Thank you! Sorry for the delay in replying. |
@abitter The code can't be included in the package as LGPL 3.0 which is a less liberal license as the Apache License 2.0 which BTM ships. Only if your code license is changed to Apache, it can be included in the package. |
@jwijffels I see – I'll check if that's possible. |
@jwijffels So I checked with @grenwi, and we provide the code above (#3 (comment)) also under Apache 2.0 license. |
Should |
@ginalamp I noticed that the |
Thank you for the clarification and apologies if the question is obvious. |
note perplexity does not exist for BTM models, we can implement
As defined in the BTM paper: https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
The text was updated successfully, but these errors were encountered: