Infixes Update Not Applying Properly to Tokenizer #13779
Unanswered
Rayan-Allali
asked this question in
Help: Coding & Implementations
Replies: 1 comment
-
I reproduced your issue here using the import spacy
nlp = spacy.load("en_core_web_sm")
text = "Today is 06/24/2025. Why can't this work?"
doc = nlp(text)
# the last expression is to check if infixes are added
infixes = nlp.Defaults.infixes + [r"'",] + [r'(?<=[0-9]{2})(?:/)(?=[0-9]{2,4})']
infixe_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infixe_regex.finditer
nlp.tokenizer.explain(text) Output: [('TOKEN', 'Today'),
('TOKEN', 'is'),
('TOKEN', '06'),
('INFIX', '/'),
('TOKEN', '24'),
('INFIX', '/'),
('TOKEN', '2025'),
('SUFFIX', '.'),
('TOKEN', 'Why'),
('SPECIAL-1', 'ca'),
('SPECIAL-2', "n't"),
('TOKEN', 'this'),
('TOKEN', 'work'),
('SUFFIX', '?')] Your implementation is correct, however spaCy 'SPECIAL-' tokens take precedence over all other patterns. We have to modify the new_rules = nlp.tokenizer.rules
del new_rules["Can't"] # this is only for demonstration purposes
del new_rules["can't"] # you'll likely modify the rules differently
del new_rules["'"] # if you don't remove this, the apostrophe shows up as SPECIAL-1 not INFIX
nlp.tokenizer.rules = new_rules
nlp.tokenizer.explain(text) Now the final output: [('TOKEN', 'Today'),
('TOKEN', 'is'),
('TOKEN', '06'),
('INFIX', '/'),
('TOKEN', '24'),
('INFIX', '/'),
('TOKEN', '2025'),
('SUFFIX', '.'),
('TOKEN', 'Why'),
('TOKEN', 'can'),
('INFIX', "'"),
('TOKEN', 't'),
('TOKEN', 'this'),
('TOKEN', 'work'),
('SUFFIX', '?')] You will likely have to modify the rule set differently to achieve your desired behavior and to avoid negative side effects. Hope this helps explain why the update wasn't being applied! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Title: Infixes Update Not Applying Properly to Tokenizer
Description
I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols (
'
) are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.Steps to Reproduce
Here are the two approaches I tried:
1️⃣ Removing apostrophe-related rules from
infixes
and recompiling:Issue: Even after modifying the infix rules, contractions like
"can't"
still split incorrectly.2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):
Expected Behavior
Actual Behavior
nlp.tokenizer.infix_finditer
do not seem to take effect.Question
Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?
Thanks for your help!
Beta Was this translation helpful? Give feedback.
All reactions