Adding many special cases to Tokenizer greatly degrades startup performance #12523

Nickersoft · 2023-04-12T03:18:12Z

Nickersoft
Apr 12, 2023

Hey folks,

I wasn't sure whether to flag this as a bug or a question, so to play it safe I opened it as a discussion. Recently, after running some benchmarks, I've noticed that adding several special cases to the spaCy tokenizer (in this case over 200k) severely impacts the time it takes to load the pipeline. For clarity, I'm attempting to add several compound English phrases to the tokenizer (like "lay out" or "garbage man") so they are preserved when processing text.

Without any special cases added, loading my pipeline takes about 3s on average.
Loading in my special cases at runtime results in a latency of about 20s.
If I save my own pipeline by loading the special cases beforehand, then serializing it to a directory and loading my pipeline from the path, it takes upwards of 40s-130s.

I would have thought that the last case would have been the most performant, seeing it's writing the tokenizer to disk with all of the special cases contained in it, so I was surprised to see it perform so poorly. I would think for the second use case this latency would make sense, as it would need to iterate over 200k words individually and add each to the tokenizer via add_special_case.

The reason I am filing this as a discussion and not a bug is I'm not sure if this is the best way to achieve what I'm hoping to, or if there is something I can do on my end to improve performance. I can provide code snippets as needed, though right now it's all pretty straightforward (loading a pipeline via load(), looping through my words and adding each via add_special_case, then writing it to disk via nlp.to_disk()).

Answered by adrianeboyd

Apr 20, 2023

This was indeed related to the internal caches and should be fixed by #12553 (to be published in the next release, probably v3.6.0).

I also tried out a custom retokenizing span ruler just to see, but it was a lot slower than the tokenizer at runtime (something like ~4x slower?). In case anyone is interested in doing something similar:

from spacy.language import Language
from spacy.pipeline import SpanRuler
from spacy.util import filter_spans


@Language.factory("retokenizing_span_ruler")
def make_retokenizing_span_ruler(
    nlp: Language,
    name: str,
):
    return RetokenizingSpanRuler(nlp, name)


class RetokenizingSpanRuler(SpanRuler):
    def set_annotations(self, doc, matches):
  …

View full answer

adrianeboyd · 2023-04-17T08:33:53Z

adrianeboyd
Apr 17, 2023

Currently the special cases are not saved in a precompiled form with nlp.to_disk, so loading special cases at runtime and loading special cases from disk should theoretically take the same amount of time. The fact that loading from disk is slower points to a bug, which is probably related to how the internal tokenizer cache is cleared and reloaded as special cases are modified, although I'd have to test it out more to be sure about what's going on.

This is a pretty large number of special cases, though, and I'm trying to think if there's a better way to handle it. The tokenizer is creating a PhraseMatcher in the background, so none of the other options that are less entangled in the tokenizer that I can think of should be faster at runtime (for example, I think span ruler + retokenization would be similar).

I'll initially take a look to see if I can reproduce the issue with slow loading times for special cases.

0 replies

Nickersoft · 2023-04-18T17:30:19Z

Nickersoft
Apr 18, 2023
Author

Hey @adrianeboyd,

Thanks so much for the response + assistance! Would love to be able to get these load times under control, as I've since realized loading these special cases is greatly impacting the wait times of our production app 😅 If it assists with testing, I've attached the list of special cases I'm using. As I mentioned, I'm just looping through them line-by-line and loading them into the tokenizer. Happy to open an issue for this as well if you think that's a better way of tracking it.

One idea I had was to write a pipeline component similar to merge_entities that would just merge special cases, but I suppose I'd still have to consult the list at runtime, and even though the load time of the pipeline might be faster, actual tokenization might be slower haha.

en.txt.zip

1 reply

svlandeg Apr 19, 2023

Hey @Nickersoft, Adriane has already opened an issue here so we don't lose track of this :-)

Thanks for attaching the list of exceptions! That will be helpful.

adrianeboyd · 2023-04-20T07:23:56Z

adrianeboyd
Apr 20, 2023

This was indeed related to the internal caches and should be fixed by #12553 (to be published in the next release, probably v3.6.0).

I also tried out a custom retokenizing span ruler just to see, but it was a lot slower than the tokenizer at runtime (something like ~4x slower?). In case anyone is interested in doing something similar:

from spacy.language import Language
from spacy.pipeline import SpanRuler
from spacy.util import filter_spans


@Language.factory("retokenizing_span_ruler")
def make_retokenizing_span_ruler(
    nlp: Language,
    name: str,
):
    return RetokenizingSpanRuler(nlp, name)


class RetokenizingSpanRuler(SpanRuler):
    def set_annotations(self, doc, matches):
        spans = filter_spans(matches)
        with doc.retokenize() as retokenizer:
            for span in spans:
                retokenizer.merge(span)
        return doc

2 replies

Nickersoft Apr 20, 2023
Author

@adrianeboyd Great to hear! So just to confirm, it sounds with my current use case the best choice would be to add all of the special cases to the tokenizer, then write/load it from disk? After the next release, of course :)

adrianeboyd Apr 21, 2023

Yes, in the next release reloading the special cases from disk should take the same amount of time as adding them at runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding many special cases to Tokenizer greatly degrades startup performance #12523

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Adding many special cases to Tokenizer greatly degrades startup performance #12523

Uh oh!

Nickersoft Apr 12, 2023

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

adrianeboyd Apr 17, 2023

Uh oh!

Nickersoft Apr 18, 2023 Author

Uh oh!

svlandeg Apr 19, 2023

Uh oh!

adrianeboyd Apr 20, 2023

Uh oh!

Nickersoft Apr 20, 2023 Author

Uh oh!

adrianeboyd Apr 21, 2023

Nickersoft
Apr 12, 2023

Replies: 3 comments 3 replies

adrianeboyd
Apr 17, 2023

Nickersoft
Apr 18, 2023
Author

adrianeboyd
Apr 20, 2023

Nickersoft Apr 20, 2023
Author