Where does EN_CORE_WEB_SM Store (Multi)HashEmbed Weights? #11723

ashammad · 2022-10-31T16:11:11Z

ashammad
Oct 31, 2022

Spacy EN_CORE_WEB_SM uses MultiHashEmbed as part of its default/standard Tok2Vec architecture. I understand from documentation, different articles, and great responses by @honnibal how MultiHashEmbed works during prediction time, but I don't fully understand how was this layer (pre-)trained and contained in pipelines like EN_CORE_WEB_SM, if they were (pre-)trained?

Now, the questions are:

Is MultiHashEmbed pre-trained and ready for use without any training, just after spacy.load("en_core_web_sm")?

In case (Multi)HashEmbed is pre-trained:
2. Where are the vector weights of the (small) embedding table stored? Are they accessible through the EN_CORE_WEB_SM instance? What is the size (number of rows) of the embedding table?
3. How were these vectors trained? Using end-to-end pipeline training on one of downstream tasks, NER for example? Or using another pre-trained word vectors (For ex: GloVe or Word2Vec) to just approximate them using MultiHashEmbed? (only using the hashing trick as a compression technique for one of the other available embeddings)

Thanks and appreciate your clarification

Answered by polm

Nov 1, 2022

MultiHashEmbed is an architecture, and pretrained English pipelines like en_core_web_sm contain a trained instance of that architecture.

Is MultiHashEmbed pre-trained and ready for use with any training, just after spacy.load("en_core_web_sm")?

You generally don't want to re-use tok2vecs when training. It's possible but doesn't usually offer an advantage.

Where are the vector weights of the (small) embedding table stored? Are they accessible through the EN_CORE_WEB_SM instance? What is the size (number of rows) of the embedding table?

The tok2vec component is serialized in the tok2vec directory of the pipeline. Details like the number of rows are specified in the model config (nlp.config

View full answer

polm · 2022-11-01T04:31:02Z

polm
Nov 1, 2022

MultiHashEmbed is an architecture, and pretrained English pipelines like en_core_web_sm contain a trained instance of that architecture.

Is MultiHashEmbed pre-trained and ready for use with any training, just after spacy.load("en_core_web_sm")?

You generally don't want to re-use tok2vecs when training. It's possible but doesn't usually offer an advantage.

Where are the vector weights of the (small) embedding table stored? Are they accessible through the EN_CORE_WEB_SM instance? What is the size (number of rows) of the embedding table?

The tok2vec component is serialized in the tok2vec directory of the pipeline. Details like the number of rows are specified in the model config (nlp.config).

The embedding tables are accessible by walking deep into the tok2vec model structure, but they are not exposed for easy access. There is normally no reason for you to access them directly, since the model handles the embedding process. Is there something you want access to them for?

How were these vectors trained? Using end-to-end pipeline training on one of downstream tasks, NER for example? Or using another pre-trained word vectors (For ex: GloVe or Word2Vec) to just approximate them using MultiHashEmbed?

The vectors are trained by backpropagating from task-specific heads like the tagger and parser. This is configurable. They are not just compressing word vectors, though word vectors can be used as features.

7 replies

ashammad Nov 11, 2022
Author

Thanks again. The first point is kind of clear now. I am still confused about the second point
I understand that thing1 and thing2 are instances of the same architecture. But, how can thing1 and thing2 pick their trained weights?

If I map the same concept to something like BERT for example, we have BERT-Base, BERT-Large, which are totally different models (in terms of weights/parameters). So, I again assume they are pre-trained separately, and cannot be transformed to one another, or transformed from any kind of Base/Parent model.

If we go back to MultiHashEmbed, it allows any number of rows to be specified, and does not stick to a predefined set of configurations. So, how can it pick the parameters/weights for any confuguration setting (rows in this example)?

Would you please elaborate more on this? Given the rows parameter of MultiHashEmbed, how it will be built for different settings of rows parameter?

polm Nov 11, 2022

To be clear about something with MultiHashEmbed - when you create an instance you can specify the number of rows and other parameters. But after you train it, you can't change the number of rows or other parameters for that instance without breaking things. I'm not very familiar with the details of different BERT models but I imagine it would be the same way.

The way things work is:

The structure of the model - the architecture - is specified in a class, with hyperparameters that include things like number of rows.
You create an instance of the model, specifying parameters like numbers of rows. The weights are meaningless to start with.
You train the model instance to make the weights meaningful. You can't resize the rows at this point (well, not trivially).

Does that clarify things?

ashammad Nov 11, 2022
Author

Thanks a million for your response. That clarifies it a little bit.

Does this mean I always need to train this layer? In that case, I assume this need a huge amount of data to learn those embeddings. How can that be practical? And how can I shortcut this and use pre-trained MultiHashEmbed?

polm Nov 11, 2022

Any model has to initialize weights somehow and then be trained by somebody somewhere.

Training a CNN tok2vec in spaCy doesn't take that long, and while you need some data, and it helps to have a lot, you don't need a ton. Even if the tok2vec weights are randomly initialized, you can use sources of information like word vectors so you aren't actually starting from scratch. So you don't need to re-used a trained MultiHashEmbed, and we usually recommend against it because it's pretty easy to get useful weights from a reasonable amount of training data.

Note that spaCy is set up so you don't have to think about these details - while it's good to understand them, and we want to make them clearer, if you haven't trained models yet I'd suggest you do so to get a feel for how the library is used, which will make understanding how the parts fit together easier. The spaCy course is great if you haven't seen it yet.

ashammad Nov 11, 2022
Author

Thanks a million for your great responses. I am very new to Spacy, and like the philosophy behind it so much, and the clear processing pipelines. I need to adopt it as the NLP framework at our company.

And finally, sorry to keep you busy answering all my questions. Really appreciated 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where does EN_CORE_WEB_SM Store (Multi)HashEmbed Weights? #11723

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Where does EN_CORE_WEB_SM Store (Multi)HashEmbed Weights? #11723

ashammad Oct 31, 2022

Replies: 1 comment · 7 replies

polm Nov 1, 2022

ashammad Nov 11, 2022 Author

polm Nov 11, 2022

ashammad Nov 11, 2022 Author

polm Nov 11, 2022

ashammad Nov 11, 2022 Author

ashammad
Oct 31, 2022

Replies: 1 comment 7 replies

polm
Nov 1, 2022

ashammad Nov 11, 2022
Author

ashammad Nov 11, 2022
Author

ashammad Nov 11, 2022
Author