-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Hi there,
I'm testing the pre-trained models from Meta, and it's nice to have the option of subword search for OOV words. However, for French at least, data cleaning leaves to be desired, leading to many duplicates and badly tokenised words.
I was thinking of testing this nevertheless, but filtering out all the words that are not in a separate dictionary file (I have pretty comprehensive lists of words). I could do that after computing the similarity, but it would be neater to remove the words and vectors from the model instead, so that there is less computation waste, and I wouldn't need to implement checks for topn
(to make sure I actually obtain the n
neighbours at each request).
Is this something that can be done using gensim
(I would have asked on fastText
first, but the repo is read-only) by any chance?
Thanks for this!
Best,
Jeremie