Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using fasttext embeddings #34

Open
Dragon615 opened this issue Nov 27, 2018 · 5 comments
Open

using fasttext embeddings #34

Dragon615 opened this issue Nov 27, 2018 · 5 comments

Comments

@Dragon615
Copy link

Hi,
Thanks for offering this great implementation.
I was wondering if the scripts support using fasttext embeddings. If not, what should we do to make fasttext embeddings work with the current implementation?
thanks in advance!!

@nreimers
Copy link
Member

Hi @Dragon615
This implementation kind of works with fasttext embeddings. For scientific experiments is is sufficient, however, for usage in real applications it would have some drawbacks when it is used with fasttext embeddings.

My work around for using fasttext embeddings:
I extracted all words in train, dev, and test set and generated using fasttext the embeddings for these words. I stored these embeddings in a text file with the same format like the GloVe embeddings:
the 0.13 0.53 0.64 ....
cat 0.15 0.75 0.23 ...

I then passed this embedding lookup file to this implementation to train the BiLSTM-CRF network.

For experiments, this work around works, as we know the test set in advance. If you would like to use it in a real application, this wouldn't work any longer as embeddings for new (unknown) tokens would not be generated. In that case, the implementation of the BiLSTM network must be extended so that it can compute fasttext embeddings on the fly.

Note:
In my paper I analyzed fasttext embeddings for various sequence tagging tasks and the performance was rather bad on all tasks. For noisy data (like twitter), I think fasttext embeddings can bring a benefit. But when have decent English data, I would prefer other embeddings (my personal recommendation: Embeddings by Komninos et al.).

@Dragon615
Copy link
Author

Thanks, @nreimers for your prompt response.

I'm interested in experimenting/using fasttext pretrained embeddings offered by Facebook for different languages.
So, I started by using the English fasttext pretrained embedding. The format of the ".vec" file they offered seems to satisfy the format you mentioned. The only differecnce I noticed is that there is an extra white space, which I removed, at the end of each line in the fasttext pretrained embeddings file. The format as follows:

story 0.032732 -0.18461 -0.050295 ....
various -0.11483 0.02119 -0.17601 ....

When I specify the path of the fasttext pretrained embeddings file (i.g. .vec file), and run the script (Train_POS.py), I got the following error:

tokens = Embedding(input_dim=self.embeddings.shape[0], output_dim=self.embeddings.shape[1], weights=[self.embeddings], trainable=False, name='word_embeddings')(tokens_input)
IndexError: tuple index out of range.

what I understand from your explanation is that to use fasttext embedding I need to train a new word embedding using the train, dev, and test sets as the implementation in its current state won't be able to deal with "unk" words. Is that correct? If so, would you please clarify more this point: "I stored these embeddings in a text file with the same format like the GloVe embeddings" I'm confused because I think the generated ".vec" file is in the GloVe format.

Again thank you so much for your response.

@nreimers
Copy link
Member

Hi,
the error can indicate that the embeddings file was not correclty loaded, i.e. that self.embeddings is not a matrix.

This can happen when the dimension for the embeddings varies, e.g., some have 300 dimensions (floats) while other have more or less.

This again can happen when file format is not clean, e.g. tokens that contain a whitespace cannot be loaded in the current implementation.

You should check the loading of the embedding file if each line has 1 token, followed by e.g. 300 floats. The code for this uses the Python split()-function.

You understood it correct. The code cannot deal with UNK words, i.e. it cannot generate fasttext embeddings for unknown tokens. Hence, you must ensure that all important word embeddings are present in the embedding file you are loading.

@nreimers
Copy link
Member

nreimers commented Nov 29, 2018

Here are the commends to generate a suitable embedding file (I tested it only on an old version of fasttext, I hope it still works with the most recent version):

./fasttext print-word-vectors model.bin < wordslist.txt > fasttext_embeddings.vec

wordslist.txt contains one token per line with all the tokens you have in the train/dev/test set.

fasttext_embeddings.vec is then in the right format and contains the embeddings for all your tokens.

As noted before, tokens that contain a whitespace cannot be processed by this implementation so far.

@Dragon615
Copy link
Author

@nreimers, Thank you so much for your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants