sentence segmentation handling of guillemets

Sentence segmentation doesn't seem to handle guillemets '«' / `»`. I end up with very large sentences merged together when there is dialogue. I see an old pr that added this for german https://github.com/explosion/spaCy/pull/4237/files and some handling in functions like is_quote, but perhaps the segmenter doesn't take this into account somehow.

## How to reproduce the behaviour

Process a document with guillemets quotes like

> Léa dit : « Bonjour ! Je suis Léa. Et toi ? » Marc répond : « Salut ! Je suis Marc. » Léa demande : « Où es-tu ? »

and check the values of the is_sent_start flags.

## Your Environment

- **spaCy version:** 3.8.7
- **Platform:** macOS-26.0.1-arm64-arm-64bit
- **Python version:** 3.12.11
- **Pipelines:** uk_core_news_md (3.8.0), pl_core_news_md (3.8.0), ca_core_news_md (3.8.0), it_core_news_md (3.8.0), ko_core_news_md (3.8.0), da_core_news_md (3.8.0), el_core_news_md (3.8.0), fr_core_news_md (3.8.0), en_core_web_md (3.8.0), es_core_news_md (3.8.0), fr_core_news_sm (3.8.0), ja_core_news_md (3.8.0), de_core_news_md (3.8.0), nl_core_news_md (3.8.0), sv_core_news_md (3.8.0), ro_core_news_md (3.8.0), pt_core_news_md (3.8.0), zh_core_web_md (3.8.0), fi_core_news_md (3.8.0), ru_core_news_md (3.8.0), hu_core_news_md (3.8.0)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

sentence segmentation handling of guillemets #13883

How to reproduce the behaviour

Your Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

sentence segmentation handling of guillemets #13883

Description

How to reproduce the behaviour

Your Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions