Open
Description
julia>WordTokenizers.split_sentences(" This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. ")
7-element Array{SubString{String},1}:
" This is a sentence.Laugh Out Loud."
"Keep coding."
"No."
"Yes!"
"True!"
"ohh!ya!"
"me too."
I observed that the sentence which has no space after delimiter(Obviously that sentence grammatically incorrect) is not considered as two separate sentences(Like .Laugh Out Loud. and Ohh!ya!). Can this consider as an issue?
Originally posted by @RohitPingale in #32 (comment)
Activity
RohitPingale commentedon Oct 14, 2019
>>> from nltk.tokenize import sent_tokenize
>>> text = " This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. "
>>> sent_tokenize(text)
[' This is a sentence.Laugh Out Loud.', 'Keep coding.', 'No.', 'Yes!', 'True!', 'ohh!ya!', 'me too.']
I tried the same example in python it giving the same output, should we consider it as the benchmark or we have to split those sentences anyway?
oxinabox commentedon Oct 14, 2019
@ninjin thoughts?