-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Open
Description
Sentence segmentation doesn't seem to handle guillemets '«' / ». I end up with very large sentences merged together when there is dialogue. I see an old pr that added this for german https://github.com/explosion/spaCy/pull/4237/files and some handling in functions like is_quote, but perhaps the segmenter doesn't take this into account somehow.
How to reproduce the behaviour
Process a document with guillemets quotes like
Léa dit : « Bonjour ! Je suis Léa. Et toi ? » Marc répond : « Salut ! Je suis Marc. » Léa demande : « Où es-tu ? »
and check the values of the is_sent_start flags.
Your Environment
- spaCy version: 3.8.7
- Platform: macOS-26.0.1-arm64-arm-64bit
- Python version: 3.12.11
- Pipelines: uk_core_news_md (3.8.0), pl_core_news_md (3.8.0), ca_core_news_md (3.8.0), it_core_news_md (3.8.0), ko_core_news_md (3.8.0), da_core_news_md (3.8.0), el_core_news_md (3.8.0), fr_core_news_md (3.8.0), en_core_web_md (3.8.0), es_core_news_md (3.8.0), fr_core_news_sm (3.8.0), ja_core_news_md (3.8.0), de_core_news_md (3.8.0), nl_core_news_md (3.8.0), sv_core_news_md (3.8.0), ro_core_news_md (3.8.0), pt_core_news_md (3.8.0), zh_core_web_md (3.8.0), fi_core_news_md (3.8.0), ru_core_news_md (3.8.0), hu_core_news_md (3.8.0)
Metadata
Metadata
Assignees
Labels
No labels