Tokenizng "Hello-world" #2

sdg002 · 2019-09-13T11:57:24Z

Hi All,
I am comparing the tokenization of the sentence Hello-world with other NLP libraries

OpenNLP
Google Natural Language (Cloud)
nltk(default)
nltk(WordPunctTokenizer)

I am just trying to get to know more about CherubNLP and the approach it follows. Is there any parameter that would make CherubNLP emit 3 tokens , like Google and OpenNLP-EnglishRuleBasedTokenizer ?

CherubNLP

I get back a single token Hello-world

OpenNLP

I am using the class OpenNLP.Tools.Tokenize.EnglishRuleBasedTokenizer and this gave me 3 tokens

Hello
"-"
world

Google NLP

https://cloud.google.com/natural-language/
Google gives me 3 tokens.

nltk

nltk.word_tokenize("Hello-world")
['Hello-world']

nltk WordPunctTokenizer

nltk.tokenize.WordPunctTokenizer().tokenize("Hello-world")
['Hello', '-', 'world']

The text was updated successfully, but these errors were encountered:

Oceania2018 · 2019-09-13T12:14:31Z

Which Tokenizor are you using? RegexTokenizer or TreebankTokenizer
https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP/Tokenize

sdg002 · 2019-09-13T12:30:38Z

Tried with TreebankTokenizer.
RegexTokenizer is throwing an ArgumentNull exception. I guess, I am not using it the right way.

Oceania2018 · 2019-09-13T12:44:33Z

Can you run this UnitTest?
https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP.UnitTest/Tokenize

sdg002 · 2019-09-16T11:58:37Z

The RegexTokenizer was able to parse "hello-world".

Unfortunately, it also split 50,000 in the sentence this will cost 50,000 into 50 and 000.

Nevertheless, your efforts are commendable. I think I am asking too much at this moment.

Oceania2018 · 2019-09-16T12:47:04Z

It can be added easily to split digital with commas. You can do it and PR.

13653415686 · 2020-02-26T05:39:08Z

这个项目太实用了，但是资料好少啊，我英文也不好，该怎么详细了解一下呢。已经运行成功了，就是不知道怎么该达到我想要的效果

Oceania2018 · 2020-02-26T11:05:09Z

请参考单元测试。

13653415686 · 2020-02-27T00:58:17Z

请参考单元测试。

谢谢老大，你的联系方式可以给一个吗，我把单元测试里面的方法都运行了，基本都可以，但是不知道具体实现的是什么功能，英文不好，也大概推测不出来，还有wordvec_enu.bin这个文件，我没下载到。我看了好多nlp的代码，你这个功能最强大，最全，最适合我了。我是着急想全部看通，但是没有文档，我短时间琢磨不透啊。

sdg002 changed the title ~~Tokening "Hello-world"~~ Tokenizng "Hello-world" Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizng "Hello-world" #2

Tokenizng "Hello-world" #2

sdg002 commented Sep 13, 2019 •

edited

Loading

Oceania2018 commented Sep 13, 2019

sdg002 commented Sep 13, 2019

Oceania2018 commented Sep 13, 2019

sdg002 commented Sep 16, 2019

Oceania2018 commented Sep 16, 2019

13653415686 commented Feb 26, 2020

Oceania2018 commented Feb 26, 2020

13653415686 commented Feb 27, 2020

Tokenizng "Hello-world" #2

Tokenizng "Hello-world" #2

Comments

sdg002 commented Sep 13, 2019 • edited Loading

Oceania2018 commented Sep 13, 2019

sdg002 commented Sep 13, 2019

Oceania2018 commented Sep 13, 2019

sdg002 commented Sep 16, 2019

Oceania2018 commented Sep 16, 2019

13653415686 commented Feb 26, 2020

Oceania2018 commented Feb 26, 2020

13653415686 commented Feb 27, 2020

sdg002 commented Sep 13, 2019 •

edited

Loading