Skip to content

Question: is it possible to remove words from a fastText model (.bin file)? #3618

@jchwenger

Description

@jchwenger

Hi there,

I'm testing the pre-trained models from Meta, and it's nice to have the option of subword search for OOV words. However, for French at least, data cleaning leaves to be desired, leading to many duplicates and badly tokenised words.

I was thinking of testing this nevertheless, but filtering out all the words that are not in a separate dictionary file (I have pretty comprehensive lists of words). I could do that after computing the similarity, but it would be neater to remove the words and vectors from the model instead, so that there is less computation waste, and I wouldn't need to implement checks for topn (to make sure I actually obtain the n neighbours at each request).

Is this something that can be done using gensim (I would have asked on fastText first, but the repo is read-only) by any chance?

Thanks for this!
Best,
Jeremie

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions