Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FastText model reading of unsupported modes from Facebook's FastText #3222

Conversation

Tewatia5355
Copy link

@Tewatia5355 Tewatia5355 commented Aug 30, 2021

Fixes #3179: Added error handling.

@SagarDollin
Copy link

SagarDollin commented Aug 31, 2021

Hey @Tewatia5355 @karan121bhukar ,
Similar to your PR , even my PR #3220 passes all the checks except for the build wheels. I have cloned the project on a windows system. Any idea how to fix this?

@Tewatia5355
Copy link
Author

Hey @Tewatia5355 @karan121bhukar ,
Similar to your PR , even my PR passes all the checks except for the build wheels. I have cloned the project on a windows system. Any idea how to fix this?

Hey there, I'm still figuring out why it didn't worked!! Will tell you if I get some leads :)

@SagarDollin
Copy link

Hey there, I'm still figuring out why it didn't worked!! Will tell you if I get some leads :)

I think this is because the file build-wheels . A change was made 13 days ago where they added username. The error we are getting is also related to username:
image

tagging @mpenkov : Should we create an issue for this. I observed that in the previous pull requests this issue wasn't seen. Infact the newer PR seems to be having more checks than previous one's.

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 31, 2021

Thank you for letting me know. I'll disable wheel building for PRs.

@SagarDollin
Copy link

Thank you for letting me know. I'll disable wheel building for PRs.

Thank you, does this mean we have to create another pull request to pass the checks?

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 1, 2021

No, there's no need for that.

@piskvorky piskvorky changed the title Fixes #3179 Fix FastText model reading of unsupported modes from Facebook's FastText Feb 18, 2022
@piskvorky piskvorky added this to the Next release milestone Feb 18, 2022
@@ -807,6 +807,13 @@ def _load_fasttext_format(model_file, encoding='utf-8', full_model=True):
with utils.open(model_file, 'rb') as fin:
m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)

# Here we are checking for unsupported FB FT Modes
if m.loss != 1 and m.loss != 2:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if m.loss != 1 and m.loss != 2:
if m.loss not in (1, 2):

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #3223, negative sampling is loss=0, not loss=2. Can you please double check? What's the correct value?

Copy link

@SagarDollin SagarDollin Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @piskvorky,
I checked it. In the parameters docstring in the models/_fasttext_bin, that I have copied here:

"""Holds data loaded from the Facebook binary.
Parameters

.
.
neg : int
    If non-zero, indicates that the model uses negative sampling.
loss : int
    If equal to 1, indicates that the model uses hierarchical sampling.
model : int
    If equal to 2, indicates that the model uses skip-grams.
"""

This suggests that m.neg==0 means no negative sampling, and m.neg == n (where n is a positive integer) means negative sampling is being used.
So I think if we are checking for negative sampling, the condition should be if m.neg == 0 or if m.neg != 0 based on what we want to do. From what I have read from the #3223 I gather that negative sampling and hierarchical softmax are supported in loss. so we shouldn't really need to check for negative sampling i.e, if m.neg != 0 (not really required)

I have made a list of what's supported and what's not based on #3223 . Please let me know if they are right, ill make changes as per that:

Supported:
a. Loss

1. Negative Sampling
2. Hierarchical Softmax

b. Model

1. CBOW
2. Skip-gram

Unsupported:
a. Loss:

  1. OneVsAll_Loss

b. Model:

  1. Supervised
    Let me know if there are any inconsistencies in my understanding. I too want to help you complete this asap.

Copy link
Owner

@piskvorky piskvorky Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't really understand that list. If the check is == 0 or != 0, why would we check for == 1 and == 2?

I don't know the fasttext parameters, so don't ask me :) Can you please check the fasttext docs or code (not gensim) for what the parameters actually are? And then make changes here to match that. Thanks.

Copy link

@SagarDollin SagarDollin Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't really understand that list. If the check is == 0 or != 0, why would we check for == 1 and == 2?

Hi yes you are right. We shouldn't be checking ==1 or ==2 in this case, also if we are checking for negative sampling we must not, use m.loss at all we must check for m.neg. (Just to clarify the above ==1 and ==2 was not done by me, this PR was created by @Tewatia5355, Im just trying to figure out what are the losses and models not supported in fasttext. )

I don't know the fasttext parameters, so don't ask me :) Can you please check the fasttext docs or code (not gensim) for what the parameters actually are? And then make changes here to match that. Thanks.

Hi, yes sure I can investigate that and ill create a PR accordingly. Also if you can provide the link for fasttext docs that would be great. (just to be sure that I'm referring the right docs). Thanks!

Copy link
Owner

@piskvorky piskvorky Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, please do. This PR seems incorrect so I'm removing the "Next release" tag – we need to fix it first, make the code correct.

Fasttext code is at https://github.com/facebookresearch/fastText.

Thanks!

# Here we are checking for unsupported FB FT Modes
if m.loss != 1 and m.loss != 2:
raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)")
elif m.model != 1 and m.model != 2:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif m.model != 1 and m.model != 2:
elif m.model not in (1, 2):

@@ -807,6 +807,13 @@ def _load_fasttext_format(model_file, encoding='utf-8', full_model=True):
with utils.open(model_file, 'rb') as fin:
m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)

# Here we are checking for unsupported FB FT Modes
if m.loss != 1 and m.loss != 2:
raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)")
raise ValueError("The fasttext `loss` parameter must be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)")

raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)")
elif m.model != 1 and m.model != 2:
raise ValueError(
"Model parameter value can be either 1 (for Continous Bag of Words model) or 2 (for Skip-gram model)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Model parameter value can be either 1 (for Continous Bag of Words model) or 2 (for Skip-gram model)")
"The fasttext `model` parameter must be either 1 (for Continuous Bag of Words model) or 2 (for Skip-gram model)")

@@ -0,0 +1,8416 @@
/* Generated by Cython 0.29.23 */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here? Please remove from the PR.

@@ -807,6 +807,13 @@ def _load_fasttext_format(model_file, encoding='utf-8', full_model=True):
with utils.open(model_file, 'rb') as fin:
m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)

# Here we are checking for unsupported FB FT Modes
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Here we are checking for unsupported FB FT Modes
# Check for unsupported Facebook FastText modes.

@piskvorky
Copy link
Owner

@SagarDollin can you address the above please? We're planning a new release soon, it'd be great to include your fix.

@piskvorky piskvorky self-assigned this Feb 25, 2022
@SagarDollin
Copy link

@SagarDollin can you address the above please? We're planning a new release soon, it'd be great to include your fix.

Hi. I have approved the changes you requested. I'll also give some time for the comments you mentioned. However Im currently in between some other work and I'll get back to this within 2-3 days.

By addressing this did you mean only approving the requested changes or to address all the comments?

Thanks,
Sagar B Dollin

@piskvorky
Copy link
Owner

By addressing this did you mean only approving the requested changes or to address all the comments?

"All the comments" – although I only see one: loss=0 vs loss=2. Is there anything else unresolved?

The rest are cosmetic language changes, to be simply accepted.

@SagarDollin
Copy link

By addressing this did you mean only approving the requested changes or to address all the comments?

"All the comments" – although I only see one: loss=0 vs loss=2. Is there anything else unresolved?

The rest are cosmetic language changes, to be simply accepted.

By all the other comments I meant these comments on PR #3220 :
#3220 (review)

@piskvorky
Copy link
Owner

piskvorky commented Feb 25, 2022

Hm, is that necessary? #3220 is a separate PR. Is it blocking this PR in some way?

@SagarDollin
Copy link

Hm, is that necessary? #3220 is a separate PR. Is it blocking this PR in some way?

No not at all. I'll take a look at this one first.

Thanks..

@piskvorky
Copy link
Owner

@SagarDollin can you finish this please? We'd like to release ASAP. Thanks!

@piskvorky
Copy link
Owner

The code in this PR seems incorrect (see #3222 (comment)) and its author went MIA, so closing here. But fixing the issue is still desirable.

@piskvorky piskvorky closed this Apr 25, 2022
@piskvorky piskvorky removed this from the Next release milestone Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gensim's FastText model reads in unsupported modes from Facebook's FastText
4 participants