Skip to content

Spaces are skipped when \n next to a word without a space and thus pipelines don't detect some features. #422

@ilyesdegardin

Description

@ilyesdegardin

Description

When the endline marker \n is located just behind a word without a space, the \n and the following space ("\n ")are detected as a unique token, tagged as SPACE and are skipped because ignore_space_tokens=True in some pipelines. Thus, after normalization the word before and the word after \n are concatenated and pipelines can no longer detect the word after. In the code, the pipeline eds.diabetes() don't detect the word "diabète" and the following code using get_text() explains why.

How to reproduce the bug

import edsnlp, edsnlp.pipes as eds

txt="problématique\n Diabète de type 1 depuis 5 ans chez une enfant de 7 ans"
nlp=edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.diabetes())
doc=nlp(txt)
for ent in doc.ents:
    print(ent.text, ent.label_)

--> Nothing in the terminal

 

from edsnlp.utils.doc_to_text import get_text

get_text(doc, attr="NORM", ignore_excluded=True, ignore_space_tokens=True)

--> 'problematiquediabete de type 1 depuis 5 ans chez une enfant de 7 ans'

## Your Environment

- Operating System: Windows11
- Python Version Used: 3.10.16
- EDS-NLP Version Used: 0.17.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions