What is the official API for specifying a fully custom tokenizer? #12579
-
I am trying to find out how to completely replace the tokenization step in spacy, so that I can pass a pretokenized list of words. Apparently this used to be possible with
But now that doesn't work either, giving:
So apparently something was changed to not have the tokenizer be the only thing that is tokenizing. Is there a documented, stable API for the part of the process that is "Listen spacy, when I call |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
A tokenizer should have the signature The doc = Doc(nlp.vocab, words=["This", "is", "a", "sentence", "."])
doc = nlp(doc) You can replace Edited to add: as long as it returns a doc with the correct vocab, so it's probably |
Beta Was this translation helpful? Give feedback.
A tokenizer should have the signature
Callable[[str], Doc]
, so one option is to provide your input in some string format that you can process to split into a list. I wouldn't really recommend this for the case with a list of words, since theDoc
API already supports this.The
nlp
pipeline isCallable[[Union[str, Doc]], Doc]
, and if you provide aDoc
as input, then the pipeline skips the tokenizer, so you can do this:You can replace
doc = Doc(...)
withdoc = my_custom_function(something)
as long as it returns a doc.Edited to add: as long as it returns a doc with the correct vocab, so it's probably
my_custom_f…