Train multilingual pipeline with LLaMA embeddings #12790

probavee · 2023-07-05T13:45:26Z

probavee
Jul 5, 2023

Hello !
I don't find anything related to using LLMs like LLaMA 7b embeddings for multilingual pipeline with spacy (tokenize, lemmatize, POS, dep, NER...). Is it irrelevant or just too new ? How is it different from using BERT embeddings ? I've looked at spacy-llms but it is more for prompt based component, or am I misunderstanding it ?

Answered by shadeMe

Jul 5, 2023

That's a good question! Large language models like LLaMA and GPT-NeoX are generally used as generative models, i.e., models that accept a prompt as input and generate a completion for it. But architecturally, they are similar to other Transformer models such as BERT and can theoretically be used to produce dense representations/embeddings for downstream tasks such as tagging, parsing, entity recognition, etc.

Currently, we do not support their direct usage in spaCy pipelines outside of spacy-llm, which - as you correctly concluded - is a prompting component. However, we do have a couple of new libraries in development that we hope to release in the near future. These will serve as a good …

View full answer

shadeMe · 2023-07-05T14:26:12Z

shadeMe
Jul 5, 2023

That's a good question! Large language models like LLaMA and GPT-NeoX are generally used as generative models, i.e., models that accept a prompt as input and generate a completion for it. But architecturally, they are similar to other Transformer models such as BERT and can theoretically be used to produce dense representations/embeddings for downstream tasks such as tagging, parsing, entity recognition, etc.

Currently, we do not support their direct usage in spaCy pipelines outside of spacy-llm, which - as you correctly concluded - is a prompting component. However, we do have a couple of new libraries in development that we hope to release in the near future. These will serve as a good foundation upon which we can build the functionality that you're requesting. Stay tuned! 🙂

3 replies

probavee Jul 5, 2023
Author

Thank you very much! Can you explain a little further why it won't work ? I mean, if I try to use a classic config.cfg and finetune it on multi treebanks the classic way like:

[components]
[components.transformer]
factory = "transformer"
model = "openlm-research/open_llama_7b"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "openlm-research/open_llama_7b"

Also are the new libraries accessible ? That's really cool ! Thanks for those awesome tools !

danieldk Jul 6, 2023

We are still developing the packages, but if you want to watch the repository to get updates, the main package is here:

https://github.com/explosion/curated-transformers

It is a pure PyTorch library that is independent of spaCy. There is also a spaCy integration package:

https://github.com/explosion/spacy-curated-transformers/tree/main

probavee Jul 6, 2023
Author

Thank you for the links. I'll look at them to get what's going on. I'm assuming it has to do we token alignment & model size but that will help me understand better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train multilingual pipeline with LLaMA embeddings #12790

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Train multilingual pipeline with LLaMA embeddings #12790

probavee Jul 5, 2023

Replies: 1 comment · 3 replies

shadeMe Jul 5, 2023

probavee Jul 5, 2023 Author

danieldk Jul 6, 2023

probavee Jul 6, 2023 Author

probavee
Jul 5, 2023

Replies: 1 comment 3 replies

shadeMe
Jul 5, 2023

probavee Jul 5, 2023
Author

probavee Jul 6, 2023
Author