Where does EN_CORE_WEB_SM Store (Multi)HashEmbed Weights? #11723
-
Spacy EN_CORE_WEB_SM uses MultiHashEmbed as part of its default/standard Tok2Vec architecture. I understand from documentation, different articles, and great responses by @honnibal how MultiHashEmbed works during prediction time, but I don't fully understand how was this layer (pre-)trained and contained in pipelines like EN_CORE_WEB_SM, if they were (pre-)trained? Now, the questions are:
In case (Multi)HashEmbed is pre-trained: Thanks and appreciate your clarification |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
MultiHashEmbed is an architecture, and pretrained English pipelines like
You generally don't want to re-use tok2vecs when training. It's possible but doesn't usually offer an advantage.
The tok2vec component is serialized in the The embedding tables are accessible by walking deep into the tok2vec model structure, but they are not exposed for easy access. There is normally no reason for you to access them directly, since the model handles the embedding process. Is there something you want access to them for?
The vectors are trained by backpropagating from task-specific heads like the tagger and parser. This is configurable. They are not just compressing word vectors, though word vectors can be used as features. |
Beta Was this translation helpful? Give feedback.
MultiHashEmbed is an architecture, and pretrained English pipelines like
en_core_web_sm
contain a trained instance of that architecture.You generally don't want to re-use tok2vecs when training. It's possible but doesn't usually offer an advantage.
The tok2vec component is serialized in the
tok2vec
directory of the pipeline. Details like the number of rows are specified in the model config (nlp.config