Skip to content

Custom suggester based on POS tags: Shape mismatch for blis.gemm #13887

@qacollective

Description

@qacollective

Background

I noticed a fairly concrete pattern in spans that I was trying to categorize using Spancat, so built a custom suggester function based on POS tags and noun chunks. It works really well to reduce the number of predicted spans to only good candidates.

I am desperately trying to get to training my pipeline using that suggester after understanding that I need to include more components in the pipeline to ensure that records come through to the suggester with the requisite annotations. However I just can't get past an error I'm getting deep in spaCy libraries. I can't easily see how my work and the error are connected. I understand abstractly that the dimensions of a layer are not matching what was expected, but how those expectations are set, I'm not sure.

Error

The error I'm getting is: Shape mismatch for blis.gemm: (40, 384), (1024, 384).

I have tried:

  • replacing my custom suggester with a standard n-gram suggester I got much the same error: ValueError: Shape mismatch for blis.gemm: (670, 384), (1024, 384)
  • changing the order of the components in the pipeline, I've tried ["tok2vec", "tagger", "parser", "attribute_ruler", "spancat"], ["tagger", "parser", "attribute_ruler", "tok2vec", "spancat"] and ["tok2vec","tagger", "attribute_ruler", "parser", "spancat"]
  • removing the replace_listeners from each of the statically sourced components upstream of spancat
  • not sourcing the tok2vec component ... i.e. starting with a new trainable one and unfreezing it as per https://github.com/adrianeboyd/workshop-dh2023/blob/main/litbank/configs/spancat_subtree_lg.cfg
  • I even tried removing each of the pipeline components individually, but of course there is a good reason for them all to be there - with the custom suggester needing the pos tagging.
  • Reading everything on the internet

As this is driving me a bit nuts, would appreciate even a whack over the head telling me what I'm doing wrong! Also happy to post more code if needed. Trying to get this working for an urgent workplace project.

How to reproduce the behaviour

[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger", "attribute_ruler", "parser", "spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = 256
upstream = "*"

[components.spancat.suggester]
@misc = "my_custom_suggester.v1"

[components.tok2vec]
source = "en_core_web_lg"

[components.tagger]
source = "en_core_web_lg"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "en_core_web_lg"
replace_listeners = ["model.tok2vec"]

[components.attribute_ruler]
source = "en_core_web_lg"
replace_listeners = ["model.tok2vec"]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = ["tok2vec","tagger","parser","attribute_ruler"]
annotating_components = ["tok2vec","tagger","parser","attribute_ruler"]
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Your Environment

  • spaCy version: 3.8.4
  • Platform: Windows-11-10.0.22631-SP0
  • Python version: 3.12.9
  • Pipelines: en_core_web_lg (3.8.0), en_core_web_trf (3.8.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions