You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the endline marker \n is located just behind a word without a space, the \n and the following space ("\n ")are detected as a unique token, tagged as SPACE and are skipped because ignore_space_tokens=True in some pipelines. Thus, after normalization the word before and the word after \n are concatenated and pipelines can no longer detect the word after. In the code, the pipeline eds.diabetes() don't detect the word "diabète" and the following code using get_text() explains why.
How to reproduce the bug
importedsnlp, edsnlp.pipesasedstxt="problématique\n Diabète de type 1 depuis 5 ans chez une enfant de 7 ans"nlp=edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.diabetes())
doc=nlp(txt)
forentindoc.ents:
print(ent.text, ent.label_)
-->Nothingintheterminalfromedsnlp.utils.doc_to_textimportget_textget_text(doc, attr="NORM", ignore_excluded=True, ignore_space_tokens=True)
-->'problematiquediabete de type 1 depuis 5 ans chez une enfant de 7 ans'## Your Environment-OperatingSystem: Windows11-PythonVersionUsed: 3.10.16-EDS-NLPVersionUsed: 0.17.1