Infixes Update Not Applying Properly to Tokenizer #13779

Rayan-Allali · 2025-03-26T08:58:49Z

Rayan-Allali
Mar 26, 2025

Title: Infixes Update Not Applying Properly to Tokenizer

Description

I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols ( ') are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.

Steps to Reproduce

Here are the two approaches I tried:

1️⃣ Removing apostrophe-related rules from infixes and recompiling:

default_infixes = [pattern for pattern in nlp.Defaults.infixes if "'" not in pattern]
infix_re = compile_infix_regex(default_infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

Issue: Even after modifying the infix rules, contractions like "can't" still split incorrectly.

2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):

infixes = nlp.Defaults.infixes + [r"'",]  
infixe_regex = spacy.util.compile_infix_regex(infixes)  
nlp.tokenizer.infix_finditer = infixe_regex.finditer

Expected Behavior

The tokenizer should correctly apply the new infix rules.

Actual Behavior

Changes to nlp.tokenizer.infix_finditer do not seem to take effect.

Question

Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?

Thanks for your help!

weezymatt · 2025-06-24T20:42:44Z

weezymatt
Jun 24, 2025

Hi @Rayan-Allali

I reproduced your issue here using the nlp.tokenizer.explain method which helps us a bit:

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Today is 06/24/2025. Why can't this work?"
doc = nlp(text)

# the last expression is to check if infixes are added
infixes = nlp.Defaults.infixes + [r"'",]  + [r'(?<=[0-9]{2})(?:/)(?=[0-9]{2,4})']
infixe_regex = spacy.util.compile_infix_regex(infixes)  
nlp.tokenizer.infix_finditer = infixe_regex.finditer

nlp.tokenizer.explain(text)

Output:

[('TOKEN', 'Today'),
 ('TOKEN', 'is'),
 ('TOKEN', '06'),
 ('INFIX', '/'),
 ('TOKEN', '24'),
 ('INFIX', '/'),
 ('TOKEN', '2025'),
 ('SUFFIX', '.'),
 ('TOKEN', 'Why'),
 ('SPECIAL-1', 'ca'),
 ('SPECIAL-2', "n't"),
 ('TOKEN', 'this'),
 ('TOKEN', 'work'),
 ('SUFFIX', '?')]

Your implementation is correct, however spaCy 'SPECIAL-' tokens take precedence over all other patterns. We have to modify the nlp.tokenizer.rules to get the intended behavior.

new_rules = nlp.tokenizer.rules
del new_rules["Can't"] # this is only for demonstration purposes
del new_rules["can't"] # you'll likely modify the rules differently
del new_rules["'"] # if you don't remove this, the apostrophe shows up as SPECIAL-1 not INFIX

nlp.tokenizer.rules = new_rules
nlp.tokenizer.explain(text)

Now the final output:

[('TOKEN', 'Today'),
 ('TOKEN', 'is'),
 ('TOKEN', '06'),
 ('INFIX', '/'),
 ('TOKEN', '24'),
 ('INFIX', '/'),
 ('TOKEN', '2025'),
 ('SUFFIX', '.'),
 ('TOKEN', 'Why'),
 ('TOKEN', 'can'),
 ('INFIX', "'"),
 ('TOKEN', 't'),
 ('TOKEN', 'this'),
 ('TOKEN', 'work'),
 ('SUFFIX', '?')]

You will likely have to modify the rule set differently to achieve your desired behavior and to avoid negative side effects.

Hope this helps explain why the update wasn't being applied!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Infixes Update Not Applying Properly to Tokenizer #13779

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Infixes Update Not Applying Properly to Tokenizer #13779

Uh oh!

Rayan-Allali Mar 26, 2025

Title: Infixes Update Not Applying Properly to Tokenizer

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Question

Replies: 1 comment

Uh oh!

Uh oh!

weezymatt Jun 24, 2025

Rayan-Allali
Mar 26, 2025

weezymatt
Jun 24, 2025