using fasttext embeddings #34

Dragon615 · 2018-11-27T23:05:00Z

Hi,
Thanks for offering this great implementation.
I was wondering if the scripts support using fasttext embeddings. If not, what should we do to make fasttext embeddings work with the current implementation?
thanks in advance!!

nreimers · 2018-11-28T09:09:02Z

Hi @Dragon615
This implementation kind of works with fasttext embeddings. For scientific experiments is is sufficient, however, for usage in real applications it would have some drawbacks when it is used with fasttext embeddings.

My work around for using fasttext embeddings:
I extracted all words in train, dev, and test set and generated using fasttext the embeddings for these words. I stored these embeddings in a text file with the same format like the GloVe embeddings:
the 0.13 0.53 0.64 ....
cat 0.15 0.75 0.23 ...

I then passed this embedding lookup file to this implementation to train the BiLSTM-CRF network.

For experiments, this work around works, as we know the test set in advance. If you would like to use it in a real application, this wouldn't work any longer as embeddings for new (unknown) tokens would not be generated. In that case, the implementation of the BiLSTM network must be extended so that it can compute fasttext embeddings on the fly.

Note:
In my paper I analyzed fasttext embeddings for various sequence tagging tasks and the performance was rather bad on all tasks. For noisy data (like twitter), I think fasttext embeddings can bring a benefit. But when have decent English data, I would prefer other embeddings (my personal recommendation: Embeddings by Komninos et al.).

Dragon615 · 2018-11-28T18:24:29Z

Thanks, @nreimers for your prompt response.

I'm interested in experimenting/using fasttext pretrained embeddings offered by Facebook for different languages.
So, I started by using the English fasttext pretrained embedding. The format of the ".vec" file they offered seems to satisfy the format you mentioned. The only differecnce I noticed is that there is an extra white space, which I removed, at the end of each line in the fasttext pretrained embeddings file. The format as follows:

story 0.032732 -0.18461 -0.050295 ....
various -0.11483 0.02119 -0.17601 ....

When I specify the path of the fasttext pretrained embeddings file (i.g. .vec file), and run the script (Train_POS.py), I got the following error:

tokens = Embedding(input_dim=self.embeddings.shape[0], output_dim=self.embeddings.shape[1], weights=[self.embeddings], trainable=False, name='word_embeddings')(tokens_input)
IndexError: tuple index out of range.

what I understand from your explanation is that to use fasttext embedding I need to train a new word embedding using the train, dev, and test sets as the implementation in its current state won't be able to deal with "unk" words. Is that correct? If so, would you please clarify more this point: "I stored these embeddings in a text file with the same format like the GloVe embeddings" I'm confused because I think the generated ".vec" file is in the GloVe format.

Again thank you so much for your response.

nreimers · 2018-11-29T09:40:51Z

Hi,
the error can indicate that the embeddings file was not correclty loaded, i.e. that self.embeddings is not a matrix.

This can happen when the dimension for the embeddings varies, e.g., some have 300 dimensions (floats) while other have more or less.

This again can happen when file format is not clean, e.g. tokens that contain a whitespace cannot be loaded in the current implementation.

You should check the loading of the embedding file if each line has 1 token, followed by e.g. 300 floats. The code for this uses the Python split()-function.

You understood it correct. The code cannot deal with UNK words, i.e. it cannot generate fasttext embeddings for unknown tokens. Hence, you must ensure that all important word embeddings are present in the embedding file you are loading.

nreimers · 2018-11-29T09:51:18Z

Here are the commends to generate a suitable embedding file (I tested it only on an old version of fasttext, I hope it still works with the most recent version):

./fasttext print-word-vectors model.bin < wordslist.txt > fasttext_embeddings.vec

wordslist.txt contains one token per line with all the tokens you have in the train/dev/test set.

fasttext_embeddings.vec is then in the right format and contains the embeddings for all your tokens.

As noted before, tokens that contain a whitespace cannot be processed by this implementation so far.

Dragon615 · 2018-12-02T05:35:07Z

@nreimers, Thank you so much for your response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using fasttext embeddings #34

using fasttext embeddings #34

Dragon615 commented Nov 27, 2018

nreimers commented Nov 28, 2018

Dragon615 commented Nov 28, 2018

nreimers commented Nov 29, 2018

nreimers commented Nov 29, 2018 •

edited

Loading

Dragon615 commented Dec 2, 2018

using fasttext embeddings #34

using fasttext embeddings #34

Comments

Dragon615 commented Nov 27, 2018

nreimers commented Nov 28, 2018

Dragon615 commented Nov 28, 2018

nreimers commented Nov 29, 2018

nreimers commented Nov 29, 2018 • edited Loading

Dragon615 commented Dec 2, 2018

nreimers commented Nov 29, 2018 •

edited

Loading