Adding many special cases to Tokenizer greatly degrades startup performance #12523
-
Hey folks, I wasn't sure whether to flag this as a bug or a question, so to play it safe I opened it as a discussion. Recently, after running some benchmarks, I've noticed that adding several special cases to the spaCy tokenizer (in this case over 200k) severely impacts the time it takes to load the pipeline. For clarity, I'm attempting to add several compound English phrases to the tokenizer (like "lay out" or "garbage man") so they are preserved when processing text.
I would have thought that the last case would have been the most performant, seeing it's writing the tokenizer to disk with all of the special cases contained in it, so I was surprised to see it perform so poorly. I would think for the second use case this latency would make sense, as it would need to iterate over 200k words individually and add each to the tokenizer via The reason I am filing this as a discussion and not a bug is I'm not sure if this is the best way to achieve what I'm hoping to, or if there is something I can do on my end to improve performance. I can provide code snippets as needed, though right now it's all pretty straightforward (loading a pipeline via |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
Currently the special cases are not saved in a precompiled form with This is a pretty large number of special cases, though, and I'm trying to think if there's a better way to handle it. The tokenizer is creating a I'll initially take a look to see if I can reproduce the issue with slow loading times for special cases. |
Beta Was this translation helpful? Give feedback.
-
Hey @adrianeboyd, Thanks so much for the response + assistance! Would love to be able to get these load times under control, as I've since realized loading these special cases is greatly impacting the wait times of our production app 😅 If it assists with testing, I've attached the list of special cases I'm using. As I mentioned, I'm just looping through them line-by-line and loading them into the tokenizer. Happy to open an issue for this as well if you think that's a better way of tracking it. One idea I had was to write a pipeline component similar to |
Beta Was this translation helpful? Give feedback.
-
This was indeed related to the internal caches and should be fixed by #12553 (to be published in the next release, probably v3.6.0). I also tried out a custom retokenizing span ruler just to see, but it was a lot slower than the tokenizer at runtime (something like ~4x slower?). In case anyone is interested in doing something similar: from spacy.language import Language
from spacy.pipeline import SpanRuler
from spacy.util import filter_spans
@Language.factory("retokenizing_span_ruler")
def make_retokenizing_span_ruler(
nlp: Language,
name: str,
):
return RetokenizingSpanRuler(nlp, name)
class RetokenizingSpanRuler(SpanRuler):
def set_annotations(self, doc, matches):
spans = filter_spans(matches)
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
return doc |
Beta Was this translation helpful? Give feedback.
This was indeed related to the internal caches and should be fixed by #12553 (to be published in the next release, probably v3.6.0).
I also tried out a custom retokenizing span ruler just to see, but it was a lot slower than the tokenizer at runtime (something like ~4x slower?). In case anyone is interested in doing something similar: