-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizng "Hello-world" #2
Comments
Which |
Tried with |
Can you run this UnitTest? |
The Unfortunately, it also split Nevertheless, your efforts are commendable. I think I am asking too much at this moment. |
It can be added easily to split digital with commas. You can do it and PR. |
这个项目太实用了,但是资料好少啊,我英文也不好,该怎么详细了解一下呢。已经运行成功了,就是不知道怎么该达到我想要的效果 |
请参考单元测试。 |
谢谢老大,你的联系方式可以给一个吗,我把单元测试里面的方法都运行了,基本都可以,但是不知道具体实现的是什么功能,英文不好,也大概推测不出来,还有wordvec_enu.bin这个文件,我没下载到。我看了好多nlp的代码,你这个功能最强大,最全,最适合我了。我是着急想全部看通,但是没有文档,我短时间琢磨不透啊。 |
Hi All,
I am comparing the tokenization of the sentence
Hello-world
with other NLP librariesI am just trying to get to know more about CherubNLP and the approach it follows. Is there any parameter that would make CherubNLP emit 3 tokens , like
Google
andOpenNLP-EnglishRuleBasedTokenizer
?CherubNLP
I get back a single token
Hello-world
OpenNLP
I am using the class
OpenNLP.Tools.Tokenize.EnglishRuleBasedTokenizer
and this gave me 3 tokensGoogle NLP
https://cloud.google.com/natural-language/
Google gives me 3 tokens.
nltk
nltk WordPunctTokenizer
The text was updated successfully, but these errors were encountered: