-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix FastText model reading of unsupported modes from Facebook's FastText #3222
Fix FastText model reading of unsupported modes from Facebook's FastText #3222
Conversation
Hey @Tewatia5355 @karan121bhukar , |
Hey there, I'm still figuring out why it didn't worked!! Will tell you if I get some leads :) |
I think this is because the file build-wheels . A change was made 13 days ago where they added username. The error we are getting is also related to username: tagging @mpenkov : Should we create an issue for this. I observed that in the previous pull requests this issue wasn't seen. Infact the newer PR seems to be having more checks than previous one's. |
Thank you for letting me know. I'll disable wheel building for PRs. |
Thank you, does this mean we have to create another pull request to pass the checks? |
No, there's no need for that. |
@@ -807,6 +807,13 @@ def _load_fasttext_format(model_file, encoding='utf-8', full_model=True): | |||
with utils.open(model_file, 'rb') as fin: | |||
m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model) | |||
|
|||
# Here we are checking for unsupported FB FT Modes | |||
if m.loss != 1 and m.loss != 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if m.loss != 1 and m.loss != 2: | |
if m.loss not in (1, 2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to #3223, negative sampling is loss=0
, not loss=2
. Can you please double check? What's the correct value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @piskvorky,
I checked it. In the parameters docstring in the models/_fasttext_bin, that I have copied here:
"""Holds data loaded from the Facebook binary.
Parameters
.
.
neg : int
If non-zero, indicates that the model uses negative sampling.
loss : int
If equal to 1, indicates that the model uses hierarchical sampling.
model : int
If equal to 2, indicates that the model uses skip-grams.
"""
This suggests that m.neg==0
means no negative sampling, and m.neg == n
(where n is a positive integer) means negative sampling is being used.
So I think if we are checking for negative sampling, the condition should be if m.neg == 0
or if m.neg != 0
based on what we want to do. From what I have read from the #3223 I gather that negative sampling and hierarchical softmax are supported in loss. so we shouldn't really need to check for negative sampling i.e, if m.neg != 0
(not really required)
I have made a list of what's supported and what's not based on #3223 . Please let me know if they are right, ill make changes as per that:
Supported:
a. Loss
1. Negative Sampling
2. Hierarchical Softmax
b. Model
1. CBOW
2. Skip-gram
Unsupported:
a. Loss:
- OneVsAll_Loss
b. Model:
- Supervised
Let me know if there are any inconsistencies in my understanding. I too want to help you complete this asap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't really understand that list. If the check is == 0
or != 0
, why would we check for == 1
and == 2
?
I don't know the fasttext parameters, so don't ask me :) Can you please check the fasttext docs or code (not gensim) for what the parameters actually are? And then make changes here to match that. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't really understand that list. If the check is
== 0
or!= 0
, why would we check for== 1
and== 2
?
Hi yes you are right. We shouldn't be checking ==1
or ==2
in this case, also if we are checking for negative sampling we must not, use m.loss
at all we must check for m.neg
. (Just to clarify the above ==1 and ==2 was not done by me, this PR was created by @Tewatia5355, Im just trying to figure out what are the losses and models not supported in fasttext. )
I don't know the fasttext parameters, so don't ask me :) Can you please check the fasttext docs or code (not gensim) for what the parameters actually are? And then make changes here to match that. Thanks.
Hi, yes sure I can investigate that and ill create a PR accordingly. Also if you can provide the link for fasttext docs that would be great. (just to be sure that I'm referring the right docs). Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, please do. This PR seems incorrect so I'm removing the "Next release" tag – we need to fix it first, make the code correct.
Fasttext code is at https://github.com/facebookresearch/fastText.
Thanks!
# Here we are checking for unsupported FB FT Modes | ||
if m.loss != 1 and m.loss != 2: | ||
raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)") | ||
elif m.model != 1 and m.model != 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif m.model != 1 and m.model != 2: | |
elif m.model not in (1, 2): |
@@ -807,6 +807,13 @@ def _load_fasttext_format(model_file, encoding='utf-8', full_model=True): | |||
with utils.open(model_file, 'rb') as fin: | |||
m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model) | |||
|
|||
# Here we are checking for unsupported FB FT Modes | |||
if m.loss != 1 and m.loss != 2: | |||
raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)") | |
raise ValueError("The fasttext `loss` parameter must be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)") |
raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)") | ||
elif m.model != 1 and m.model != 2: | ||
raise ValueError( | ||
"Model parameter value can be either 1 (for Continous Bag of Words model) or 2 (for Skip-gram model)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Model parameter value can be either 1 (for Continous Bag of Words model) or 2 (for Skip-gram model)") | |
"The fasttext `model` parameter must be either 1 (for Continuous Bag of Words model) or 2 (for Skip-gram model)") |
@@ -0,0 +1,8416 @@ | |||
/* Generated by Cython 0.29.23 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here? Please remove from the PR.
@@ -807,6 +807,13 @@ def _load_fasttext_format(model_file, encoding='utf-8', full_model=True): | |||
with utils.open(model_file, 'rb') as fin: | |||
m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model) | |||
|
|||
# Here we are checking for unsupported FB FT Modes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Here we are checking for unsupported FB FT Modes | |
# Check for unsupported Facebook FastText modes. |
@SagarDollin can you address the above please? We're planning a new release soon, it'd be great to include your fix. |
Hi. I have approved the changes you requested. I'll also give some time for the comments you mentioned. However Im currently in between some other work and I'll get back to this within 2-3 days. By addressing this did you mean only approving the requested changes or to address all the comments? Thanks, |
"All the comments" – although I only see one: The rest are cosmetic language changes, to be simply accepted. |
By all the other comments I meant these comments on PR #3220 : |
Hm, is that necessary? #3220 is a separate PR. Is it blocking this PR in some way? |
No not at all. I'll take a look at this one first. Thanks.. |
@SagarDollin can you finish this please? We'd like to release ASAP. Thanks! |
The code in this PR seems incorrect (see #3222 (comment)) and its author went MIA, so closing here. But fixing the issue is still desirable. |
Fixes #3179: Added error handling.