Skip to content

What is the official API for specifying a fully custom tokenizer? #12579

Discussion options

You must be logged in to vote

A tokenizer should have the signature Callable[[str], Doc], so one option is to provide your input in some string format that you can process to split into a list. I wouldn't really recommend this for the case with a list of words, since the Doc API already supports this.

The nlp pipeline is Callable[[Union[str, Doc]], Doc], and if you provide a Doc as input, then the pipeline skips the tokenizer, so you can do this:

doc = Doc(nlp.vocab, words=["This", "is", "a", "sentence", "."])
doc = nlp(doc)

You can replace doc = Doc(...) with doc = my_custom_function(something) as long as it returns a doc.

Edited to add: as long as it returns a doc with the correct vocab, so it's probably my_custom_f…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / pipeline Feature: Processing pipeline and components feat / tokenizer Feature: Tokenizer
2 participants