Tokenizer splits certain tokens but doesn't split other tokens of the same type. #12448

Bharathi-A7 · 2023-03-15T08:25:16Z

Bharathi-A7
Mar 15, 2023

I'm facing a strange problem with Spacy Tokenizer. By Default, spacy splits hyphenated words, so I wrote the below custom tokenizer to handle that issue:


     # Get the default token rules
    infixes = list(model.Defaults.infixes)
    infixes.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])")    
    infixes = tuple(infixes)                        
     # Add the removed rule     
    infixes = infixes + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"]) 
    # Remove - between letters rule
    infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x] 
    custom_infix_re = compile_infix_regex(infixes)

    # Modify the prefix and suffix rules
    prefixes = list(model.Defaults.prefixes)
    prefixes.append(r'\b\d+[A-Za-z]+\b')
    suffixes = list(model.Defaults.suffixes)
    suffixes.append(r'\b[A-Za-z]+\d+\b')


    # Compile the modified prefix, suffix, and infix rules into regular expressions
    prefix_re = spacy.util.compile_prefix_regex(prefixes)
    suffix_re = spacy.util.compile_suffix_regex(suffixes)

    # Combine the modified prefix, suffix, and infix rules into a single tokenizer
    return Tokenizer(model.vocab, prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search, infix_finditer=custom_infix_re.finditer)

So, now the issue with spacy splitting hyphenated words like "ABC-123" is resolved. But, the problem is when I encounter a token like "DC-8T", it spits it into "DC-8" and "T" when it should be treated as a single token. But, on the other hand, if I have a token like "DC-8C" it is not split. So, what is causing this difference in SpaCy's behaviour when I replace the "C" in "DC-8C" with "T" "DC-8T". I want both DC-8C and DC-8T or any digit followed by a letter or vice versa to be treated as a single token. How do I fix this problem?

When I do nlp.explain() on my text, it shows me this:

For DC-8T:

The token is: ('TOKEN', 'DC-8')
The token is: ('SUFFIX', 'T')

For DC-8C:

The token is: ('TOKEN', 'DC-8C')

Can someone please help me with this issue? I don't want either of them to be split. They should be treated as a single token.

How to reproduce the behaviour

You can try this sentence "I don't know why DC-8T is split but DC-8C is not" using the below custom tokenizer or even the default spacy tokenizer.

def custom_tokenizer(model):
    
    # Get the default token rules
    infixes = list(model.Defaults.infixes)
    infixes.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])")    
    infixes = tuple(infixes)                        
     # Add the removed rule     
    infixes = infixes + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"]) 
    # Remove - between letters rule
    infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x] 
    custom_infix_re = compile_infix_regex(infixes)

    # Modify the prefix and suffix rules
    prefixes = list(model.Defaults.prefixes)
    prefixes.append(r'\b\d+[A-Za-z]+\b')
    suffixes = list(model.Defaults.suffixes)
    suffixes.append(r'\b[A-Za-z]+\d+\b')

    # Compile the modified prefix, suffix, and infix rules into regular expressions
    prefix_re = spacy.util.compile_prefix_regex(prefixes)
    suffix_re = spacy.util.compile_suffix_regex(suffixes)

    # Combine the modified prefix, suffix, and infix rules into a single tokenizer
    return Tokenizer(model.vocab, prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search, infix_finditer=custom_infix_re.finditer)

nlp = custom_tokenizer(nlp)
doc = nlp('I don't know why DC-8T is split but DC-8C is not')
for token in doc:
     print(token)

Your Environment

Operating System: MacOS Monterey
Python Version Used: 3.8
spaCy Version Used: 3
Environment Information: Apple M1

Answered by adrianeboyd

Mar 20, 2023

tokenizer.explain is showing you where to adjust the settings: there is a suffix pattern that is splitting off T in this case.

In this particular case it's related to the UNITS (like T for terabyte).

View full answer

adrianeboyd · 2023-03-20T08:17:22Z

adrianeboyd
Mar 20, 2023

tokenizer.explain is showing you where to adjust the settings: there is a suffix pattern that is splitting off T in this case.

In this particular case it's related to the UNITS (like T for terabyte).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer splits certain tokens but doesn't split other tokens of the same type. #12448

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Tokenizer splits certain tokens but doesn't split other tokens of the same type. #12448

Bharathi-A7 Mar 15, 2023

How to reproduce the behaviour

Your Environment

Replies: 1 comment

adrianeboyd Mar 20, 2023

Bharathi-A7
Mar 15, 2023

adrianeboyd
Mar 20, 2023