English sentence segmentation missess splits, but finds them when run recursively #10398

kuchenrolle · 2022-02-28T22:16:03Z

kuchenrolle
Feb 28, 2022

How to reproduce the behaviour

nlp = spacy.load('en_core_web_trf')
text = "Had an ingrowing toenails at 16, suffered for a number of months, chopping and ripping it away. Blood drying on socks daily, very painful. Get appointment at the hospital to have them removed under a general. Have op on Friday go home no problems. Booked to have dressings changed by on following Monday. So gets to Doctors Monday and take socks off, nurse takes outer dressing off fine no issue, then she touches the main dressing slightly and i fucking scream in pain. The dressing is hessian gauze and has dried like super glue to my raw nail bed, I mean this is like a second skin, she touches the corner again and it feels like a million parallel bikini waxes.\n\nEdit will finish later phone is messing about\n\nEdit 2\n\nSo yeah I'm a guy, but I presume they hurt, so I'm in the doctors, the nurse has touched it slightly twice and I'm in agony, (I have compared this to the scene in Hostel when the guy comes round and is tied to the chair) the nurse walks back over slowly, bends down touches the gauze as light as a feather and fuck a surge of pain from being stabbed with some sort of half lava half broken glass half rusty nail weapon. By the third touch i'm soaked through with sweat, almost passing out, and she decides to soak my feet in warm water to loosen it up.\n\nAfter about 15 mins, I'm bricking myself, she holds on to a bit of a corner, and it is quite looser, a bit like pulling off a really sticky plaster, it stick hurts like fuck though as the nail bed is raw skin underneath. \n\nThe amount of sensitivity is off the chart, for example, 2 weeks this after this nightmare I tried to take a shower and the water drops hurt falling on it :|.\n\nThese weren't my only ingrown toenails either : / had 4 in total, but now I have to keep them too long so they never get that way again."
sents = [sent.text for sent in nlp(text).sents]
multi_sentence = sents[-2]
multi_sentence_sents = [sent.text for sent in nlp(multi_sentence).sents]
assert len(multi_sentence_sents) == 1

Your Environment

spaCy version: 3.2.2
Platform: Linux-3.10.0-1160.24.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.9.5
Pipelines: en_core_web_trf (3.2.0)

Expected Behaviour

I expected sentence segmentation to find all sentence boundaries and to be consistent, such that a span that has been segmented as one whole sentence will not be segmented as multiple sentences when run through sentence segmentation again.

Answered by adrianeboyd

Mar 1, 2022

The sentence segmentation is performed by the statistical parser in this pipeline, so if you provide different input texts, you may get different sentence segmentation even if one text is a substring of the other.

However, if you have any Doc object with annotated sentence boundaries for all tokens (all tokens have token.is_sent_start set to True or False) and you run it through the parser, the parser should preserve the existing sentence boundaries. If there's an existing parse that parse might be modified some within the sentence, but the sentence boundaries should stay the same.

doc = nlp(text)
sent_docs = [sent.as_doc() for sent in doc.sents]
for sent_doc2 in nlp.pipe(sent_docs):
    a…

View full answer

adrianeboyd · 2022-03-01T11:54:19Z

adrianeboyd
Mar 1, 2022

The sentence segmentation is performed by the statistical parser in this pipeline, so if you provide different input texts, you may get different sentence segmentation even if one text is a substring of the other.

However, if you have any Doc object with annotated sentence boundaries for all tokens (all tokens have token.is_sent_start set to True or False) and you run it through the parser, the parser should preserve the existing sentence boundaries. If there's an existing parse that parse might be modified some within the sentence, but the sentence boundaries should stay the same.

doc = nlp(text)
sent_docs = [sent.as_doc() for sent in doc.sents]
for sent_doc2 in nlp.pipe(sent_docs):
    assert len(list(sent_doc2.sents)) == 1

(As a note, Span.as_doc() is something you only want to use in case you really need Doc objects vs. working with sentences as Span objects, which will be faster and fine for most tasks. It's just useful for this particular demo where speed isn't important.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English sentence segmentation missess splits, but finds them when run recursively #10398

{{title}}

Replies: 2 comments

{{title}}

Select a reply

English sentence segmentation missess splits, but finds them when run recursively #10398

kuchenrolle Feb 28, 2022

How to reproduce the behaviour

Your Environment

Expected Behaviour

Replies: 2 comments

adrianeboyd Mar 1, 2022

kuchenrolle
Feb 28, 2022

adrianeboyd
Mar 1, 2022