Skip to content

Rake.split_sentences(text) uses 'u' as separator #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xyutech opened this issue Nov 10, 2017 · 3 comments
Closed

Rake.split_sentences(text) uses 'u' as separator #30

xyutech opened this issue Nov 10, 2017 · 3 comments

Comments

@xyutech
Copy link

xyutech commented Nov 10, 2017

Hello,
I met an issue that split_sentences(text) function uses 'u' as separator. For instance
text: "is an incredibly popular library and for good reason it s powerful fast"
sentences list: [u'is an incredibly pop', u'lar library and for good reason it s powerf', u'l fast']
Definitely I can fix it at my environment, but I wonder what I did wrong and why nobody met this issue before?
My environment is python 2.7, python-rake is installed with pip.

@jkterry1
Copy link
Collaborator

That just means the strings are being represented as unicode strings. '' is an ascii string in python 2.7 and u'' is a unicode string. They work the same as normal strings, details here:
https://docs.python.org/2/howto/unicode.html

That idiosyncrasy is one of the thing's cleaned up in python 3.x by the way, and one major reason it's recommended to use instead of python 2.7. I used unicode strings specifically because they're more robust and notably support more languages, and this is a multilingual library. Tell me if these are actually causing problems for you, but they shouldn't. Closed.

@xyutech
Copy link
Author

xyutech commented Nov 11, 2017

Thank you for you reply.
Just let me add some more info to make sure that we are on the same page. I did not tell about notation
u'is an incredibly pop'
It is clear. My issue was about input string was separated by 'u'. So input is:
is an incredibly popular library and for good reason it s powerful fast
and separation is
is an incredibly pop | lar library and for good reason it s powerf | l fast

@klockeph
Copy link
Contributor

Got the same Problem - 'restaurant' is being split into 'resta' and 'rant'...

fabianvf pushed a commit that referenced this issue Nov 20, 2017
The regex-string is not in Unicode, thus the \u... control sequence does have unexpected behaviour.
Just try split_sentences("restaurant"), it will return ["resta", "rant"], which is obviously bad.

Adding a simple u to the Regex, will force python to interpret it in unicode and fix this issue.

Tested with python2.7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants