Skip to content

Adding many special cases to Tokenizer greatly degrades startup performance #12523

Discussion options

You must be logged in to vote

This was indeed related to the internal caches and should be fixed by #12553 (to be published in the next release, probably v3.6.0).

I also tried out a custom retokenizing span ruler just to see, but it was a lot slower than the tokenizer at runtime (something like ~4x slower?). In case anyone is interested in doing something similar:

from spacy.language import Language
from spacy.pipeline import SpanRuler
from spacy.util import filter_spans


@Language.factory("retokenizing_span_ruler")
def make_retokenizing_span_ruler(
    nlp: Language,
    name: str,
):
    return RetokenizingSpanRuler(nlp, name)


class RetokenizingSpanRuler(SpanRuler):
    def set_annotations(self, doc, matches):
  …

Replies: 3 comments 3 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@svlandeg
Comment options

Comment options

You must be logged in to vote
2 replies
@Nickersoft
Comment options

@adrianeboyd
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
3 participants