Skip to content

English sentence segmentation missess splits, but finds them when run recursively #10398

Discussion options

You must be logged in to vote

The sentence segmentation is performed by the statistical parser in this pipeline, so if you provide different input texts, you may get different sentence segmentation even if one text is a substring of the other.

However, if you have any Doc object with annotated sentence boundaries for all tokens (all tokens have token.is_sent_start set to True or False) and you run it through the parser, the parser should preserve the existing sentence boundaries. If there's an existing parse that parse might be modified some within the sentence, but the sentence boundaries should stay the same.

doc = nlp(text)
sent_docs = [sent.as_doc() for sent in doc.sents]
for sent_doc2 in nlp.pipe(sent_docs):
    a…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / parser Feature: Dependency Parser feat / pipeline Feature: Processing pipeline and components feat / doc Feature: Doc, Span and Token objects
2 participants
Converted from issue

This discussion was converted from issue #10397 on March 01, 2022 11:55.