Skip to content

Sentence spliting of sentences with out whitespace after period #38

Open
@oxinabox

Description

@oxinabox
Member

julia>WordTokenizers.split_sentences(" This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. ")
7-element Array{SubString{String},1}:
" This is a sentence.Laugh Out Loud."
"Keep coding."
"No."
"Yes!"
"True!"
"ohh!ya!"
"me too."
I observed that the sentence which has no space after delimiter(Obviously that sentence grammatically incorrect) is not considered as two separate sentences(Like .Laugh Out Loud. and Ohh!ya!). Can this consider as an issue?

Originally posted by @RohitPingale in #32 (comment)

Activity

RohitPingale

RohitPingale commented on Oct 14, 2019

@RohitPingale

>>> from nltk.tokenize import sent_tokenize
>>> text = " This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. "
>>> sent_tokenize(text)
[' This is a sentence.Laugh Out Loud.', 'Keep coding.', 'No.', 'Yes!', 'True!', 'ohh!ya!', 'me too.']
I tried the same example in python it giving the same output, should we consider it as the benchmark or we have to split those sentences anyway?

oxinabox

oxinabox commented on Oct 14, 2019

@oxinabox
MemberAuthor

@ninjin thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @oxinabox@RohitPingale

        Issue actions

          Sentence spliting of sentences with out whitespace after period · Issue #38 · JuliaText/WordTokenizers.jl