-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using fasttext embeddings #34
Comments
Hi @Dragon615 My work around for using fasttext embeddings: I then passed this embedding lookup file to this implementation to train the BiLSTM-CRF network. For experiments, this work around works, as we know the test set in advance. If you would like to use it in a real application, this wouldn't work any longer as embeddings for new (unknown) tokens would not be generated. In that case, the implementation of the BiLSTM network must be extended so that it can compute fasttext embeddings on the fly. Note: |
Thanks, @nreimers for your prompt response. I'm interested in experimenting/using fasttext pretrained embeddings offered by Facebook for different languages. story 0.032732 -0.18461 -0.050295 .... When I specify the path of the fasttext pretrained embeddings file (i.g. .vec file), and run the script (Train_POS.py), I got the following error: tokens = Embedding(input_dim=self.embeddings.shape[0], output_dim=self.embeddings.shape[1], weights=[self.embeddings], trainable=False, name='word_embeddings')(tokens_input) what I understand from your explanation is that to use fasttext embedding I need to train a new word embedding using the train, dev, and test sets as the implementation in its current state won't be able to deal with "unk" words. Is that correct? If so, would you please clarify more this point: "I stored these embeddings in a text file with the same format like the GloVe embeddings" I'm confused because I think the generated ".vec" file is in the GloVe format. Again thank you so much for your response. |
Hi, This can happen when the dimension for the embeddings varies, e.g., some have 300 dimensions (floats) while other have more or less. This again can happen when file format is not clean, e.g. tokens that contain a whitespace cannot be loaded in the current implementation. You should check the loading of the embedding file if each line has 1 token, followed by e.g. 300 floats. The code for this uses the Python split()-function. You understood it correct. The code cannot deal with UNK words, i.e. it cannot generate fasttext embeddings for unknown tokens. Hence, you must ensure that all important word embeddings are present in the embedding file you are loading. |
Here are the commends to generate a suitable embedding file (I tested it only on an old version of fasttext, I hope it still works with the most recent version):
wordslist.txt contains one token per line with all the tokens you have in the train/dev/test set. fasttext_embeddings.vec is then in the right format and contains the embeddings for all your tokens. As noted before, tokens that contain a whitespace cannot be processed by this implementation so far. |
@nreimers, Thank you so much for your response. |
Hi,
Thanks for offering this great implementation.
I was wondering if the scripts support using fasttext embeddings. If not, what should we do to make fasttext embeddings work with the current implementation?
thanks in advance!!
The text was updated successfully, but these errors were encountered: