Tokenizer splits certain tokens but doesn't split other tokens of the same type. #12448
-
I'm facing a strange problem with Spacy Tokenizer. By Default, spacy splits hyphenated words, so I wrote the below custom tokenizer to handle that issue:
So, now the issue with spacy splitting hyphenated words like "ABC-123" is resolved. But, the problem is when I encounter a token like "DC-8T", it spits it into "DC-8" and "T" when it should be treated as a single token. But, on the other hand, if I have a token like "DC-8C" it is not split. So, what is causing this difference in SpaCy's behaviour when I replace the "C" in "DC-8C" with "T" "DC-8T". I want both DC-8C and DC-8T or any digit followed by a letter or vice versa to be treated as a single token. How do I fix this problem? When I do nlp.explain() on my text, it shows me this: For DC-8T:
For DC-8C:
Can someone please help me with this issue? I don't want either of them to be split. They should be treated as a single token. How to reproduce the behaviourYou can try this sentence "I don't know why DC-8T is split but DC-8C is not" using the below custom tokenizer or even the default spacy tokenizer.
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
In this particular case it's related to the |
Beta Was this translation helpful? Give feedback.
tokenizer.explain
is showing you where to adjust the settings: there is a suffix pattern that is splitting offT
in this case.In this particular case it's related to the
UNITS
(likeT
for terabyte).