-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for empty corpus in HDPModel. Fixes #2563 #2639
base: develop
Are you sure you want to change the base?
Conversation
@@ -339,6 +339,8 @@ def __init__(self, corpus, id2word, max_chunks=None, max_time=None, | |||
corresponding lda model from :meth:`~gensim.models.hdpmodel.HdpModel.suggested_lda_model` | |||
|
|||
""" | |||
if len(corpus) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is corpus
guaranteed to support len()
in HdpModel? What if it's just a generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - would this suffice? @piskvorky
if len(corpus) == 0 or len(list(corpus)) == 0:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most definitely not. The len
exception is still there, plus now you're serializing the entire corpus into RAM! If corpus
is a generator, its content will be irretrievably gone afterward.
Does HdpModel support generators? Or are we guaranteed to have a repeatable iterable with len
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok got it, well the comments in the file just say
corpus : iterable of list of (int, float)
Corpus in BoW format.
so I wouldn't call this a guarantee but it does sound like list is the intended type. Should there maybe be type checking to guarantee that list is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a list. It's an iterable of something.
That something is a list of tuples, but that still doesn't change the fact that you're dealing with an iterable, and you won't necessarily be able to call len()
on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it maybe be easiest to use the existing utils.is_corpus method in the constructor? since it covers the empty corpus case and works for generators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd say calling len(corpus)
is probably a bug, because it risks consuming the iterable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible the HDP model does not support iterables. It is possible it supports list
only :( (unlike other models in Gensim)
That's why I ask about the contract of HdpModel – whether it explicitly needs its input data in RAM (~list), or if it supports streamed inputs (~any iterable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky I'll look over the code and also look around for usage of the model with a non-list iterable. if it turns out that only list is supported, would it be helpful to add support for all iterables? or not worth the effort since the model may be deprecated soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look over the code and also look around for usage of the model with a non-list iterable.
That will be great, thanks!
or not worth the effort since the model may be deprecated soon
I'd say "not worth the effort" right now. It is probably possible to clean up and optimize HDP, but we lack a clear motivating use-case. So, before any technical optimizations, we'd need a good strong tutorial / how-to document, showing off why HDP is worth it, how it's better than other models, when to use it.
The performance considerations are secondary to my mind.
There are a couple of ways to go about preventing the infinite loop mentioned in the Issue from occurring. I thought maybe to user logger.error, or to skip the while loop if the corpus is empty. ButI saw in the inference method that a RunTimeError is thrown when lda_alpha and lda_beta are not initialized. So I thought this seemed consistent.