English sentence segmentation missess splits, but finds them when run recursively #10398
-
How to reproduce the behaviournlp = spacy.load('en_core_web_trf')
text = "Had an ingrowing toenails at 16, suffered for a number of months, chopping and ripping it away. Blood drying on socks daily, very painful. Get appointment at the hospital to have them removed under a general. Have op on Friday go home no problems. Booked to have dressings changed by on following Monday. So gets to Doctors Monday and take socks off, nurse takes outer dressing off fine no issue, then she touches the main dressing slightly and i fucking scream in pain. The dressing is hessian gauze and has dried like super glue to my raw nail bed, I mean this is like a second skin, she touches the corner again and it feels like a million parallel bikini waxes.\n\nEdit will finish later phone is messing about\n\nEdit 2\n\nSo yeah I'm a guy, but I presume they hurt, so I'm in the doctors, the nurse has touched it slightly twice and I'm in agony, (I have compared this to the scene in Hostel when the guy comes round and is tied to the chair) the nurse walks back over slowly, bends down touches the gauze as light as a feather and fuck a surge of pain from being stabbed with some sort of half lava half broken glass half rusty nail weapon. By the third touch i'm soaked through with sweat, almost passing out, and she decides to soak my feet in warm water to loosen it up.\n\nAfter about 15 mins, I'm bricking myself, she holds on to a bit of a corner, and it is quite looser, a bit like pulling off a really sticky plaster, it stick hurts like fuck though as the nail bed is raw skin underneath. \n\nThe amount of sensitivity is off the chart, for example, 2 weeks this after this nightmare I tried to take a shower and the water drops hurt falling on it :|.\n\nThese weren't my only ingrown toenails either : / had 4 in total, but now I have to keep them too long so they never get that way again."
sents = [sent.text for sent in nlp(text).sents]
multi_sentence = sents[-2]
multi_sentence_sents = [sent.text for sent in nlp(multi_sentence).sents]
assert len(multi_sentence_sents) == 1 Your Environment
Expected BehaviourI expected sentence segmentation to find all sentence boundaries and to be consistent, such that a span that has been segmented as one whole sentence will not be segmented as multiple sentences when run through sentence segmentation again. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
The sentence segmentation is performed by the statistical parser in this pipeline, so if you provide different input texts, you may get different sentence segmentation even if one text is a substring of the other. However, if you have any doc = nlp(text)
sent_docs = [sent.as_doc() for sent in doc.sents]
for sent_doc2 in nlp.pipe(sent_docs):
assert len(list(sent_doc2.sents)) == 1 (As a note, |
Beta Was this translation helpful? Give feedback.
The sentence segmentation is performed by the statistical parser in this pipeline, so if you provide different input texts, you may get different sentence segmentation even if one text is a substring of the other.
However, if you have any
Doc
object with annotated sentence boundaries for all tokens (all tokens havetoken.is_sent_start
set toTrue
orFalse
) and you run it through the parser, the parser should preserve the existing sentence boundaries. If there's an existing parse that parse might be modified some within the sentence, but the sentence boundaries should stay the same.