Skip to content

Where does EN_CORE_WEB_SM Store (Multi)HashEmbed Weights? #11723

Discussion options

You must be logged in to vote

MultiHashEmbed is an architecture, and pretrained English pipelines like en_core_web_sm contain a trained instance of that architecture.

Is MultiHashEmbed pre-trained and ready for use with any training, just after spacy.load("en_core_web_sm")?

You generally don't want to re-use tok2vecs when training. It's possible but doesn't usually offer an advantage.

Where are the vector weights of the (small) embedding table stored? Are they accessible through the EN_CORE_WEB_SM instance? What is the size (number of rows) of the embedding table?

The tok2vec component is serialized in the tok2vec directory of the pipeline. Details like the number of rows are specified in the model config (nlp.config

Replies: 1 comment 7 replies

Comment options

You must be logged in to vote
7 replies
@ashammad
Comment options

@polm
Comment options

@ashammad
Comment options

@polm
Comment options

@ashammad
Comment options

Answer selected by ashammad
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation and website lang / en English language data and models
2 participants