What is the official API for specifying a fully custom tokenizer? #12579

BrenBarn · 2023-04-27T00:27:05Z

BrenBarn
Apr 27, 2023

I am trying to find out how to completely replace the tokenization step in spacy, so that I can pass a pretokenized list of words. Apparently this used to be possible with nlp.tokenizer.tokens_from_list but that was removed. On that bug report I found a suggestion to instead specify a custom tokenizer class like this:

class CustomTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, words):
        return Doc(self.vocab, words=words)

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = CustomTokenizer(nlp.vocab)
doc = nlp(["This", "is", "a", "sentence", "."])

But now that doesn't work either, giving:

ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'list'>

So apparently something was changed to not have the tokenizer be the only thing that is tokenizing.

Is there a documented, stable API for the part of the process that is "Listen spacy, when I call nlp(something), I want you to call my_custom_function(something), and that will return a Doc, and I don't want you to care at all about what something is, because my function will handle that?"

Answered by adrianeboyd

Apr 27, 2023

A tokenizer should have the signature Callable[[str], Doc], so one option is to provide your input in some string format that you can process to split into a list. I wouldn't really recommend this for the case with a list of words, since the Doc API already supports this.

The nlp pipeline is Callable[[Union[str, Doc]], Doc], and if you provide a Doc as input, then the pipeline skips the tokenizer, so you can do this:

doc = Doc(nlp.vocab, words=["This", "is", "a", "sentence", "."])
doc = nlp(doc)

You can replace doc = Doc(...) with doc = my_custom_function(something) as long as it returns a doc.

Edited to add: as long as it returns a doc with the correct vocab, so it's probably my_custom_f…

View full answer

adrianeboyd · 2023-04-27T08:36:09Z

adrianeboyd
Apr 27, 2023

A tokenizer should have the signature Callable[[str], Doc], so one option is to provide your input in some string format that you can process to split into a list. I wouldn't really recommend this for the case with a list of words, since the Doc API already supports this.

The nlp pipeline is Callable[[Union[str, Doc]], Doc], and if you provide a Doc as input, then the pipeline skips the tokenizer, so you can do this:

doc = Doc(nlp.vocab, words=["This", "is", "a", "sentence", "."])
doc = nlp(doc)

You can replace doc = Doc(...) with doc = my_custom_function(something) as long as it returns a doc.

Edited to add: as long as it returns a doc with the correct vocab, so it's probably my_custom_function(nlp.vocab, something).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the official API for specifying a fully custom tokenizer? #12579

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is the official API for specifying a fully custom tokenizer? #12579

BrenBarn Apr 27, 2023

Replies: 1 comment

adrianeboyd Apr 27, 2023

BrenBarn
Apr 27, 2023

adrianeboyd
Apr 27, 2023