anntoconll : fix word with accents being split by tokenization #1307

Marny30 · 2019-02-08T11:30:07Z

In the current version, the anntoconll tool will split a word containing accents into different tokens, isolating the accents as if they were words. When working with european languages such as French, Spanish, etc or even German with the ß this comes to be a problem.

For example, the text "déjà fait" would be split into tokens "d", "é", "j", "à", "fait" instead of "déjà", "fait"

By adding all accents range À-ÿ to the tokenization regex, this tokenization issue doesn't happen anymore.

fix word with accents being split by tokenization

af2e57d

Marny30 changed the title ~~fix word with accents being split by tokenization~~ anntoconnl : fix word with accents being split by tokenization Feb 8, 2019

Marny30 changed the title ~~anntoconnl : fix word with accents being split by tokenization~~ anntoconll : fix word with accents being split by tokenization Feb 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anntoconll : fix word with accents being split by tokenization #1307

anntoconll : fix word with accents being split by tokenization #1307

Marny30 commented Feb 8, 2019

anntoconll : fix word with accents being split by tokenization #1307

Are you sure you want to change the base?

anntoconll : fix word with accents being split by tokenization #1307

Conversation

Marny30 commented Feb 8, 2019