Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizng "Hello-world" #2

Open
sdg002 opened this issue Sep 13, 2019 · 8 comments
Open

Tokenizng "Hello-world" #2

sdg002 opened this issue Sep 13, 2019 · 8 comments

Comments

@sdg002
Copy link

sdg002 commented Sep 13, 2019

Hi All,
I am comparing the tokenization of the sentence Hello-world with other NLP libraries

  1. OpenNLP
  2. Google Natural Language (Cloud)
  3. nltk(default)
  4. nltk(WordPunctTokenizer)

I am just trying to get to know more about CherubNLP and the approach it follows. Is there any parameter that would make CherubNLP emit 3 tokens , like Google and OpenNLP-EnglishRuleBasedTokenizer ?

CherubNLP

I get back a single token Hello-world

OpenNLP

I am using the class OpenNLP.Tools.Tokenize.EnglishRuleBasedTokenizer and this gave me 3 tokens

  • Hello
  • "-"
  • world

Google NLP

https://cloud.google.com/natural-language/
Google gives me 3 tokens.

GoogleCloud

nltk

nltk.word_tokenize("Hello-world")
['Hello-world']

nltk WordPunctTokenizer

nltk.tokenize.WordPunctTokenizer().tokenize("Hello-world")
['Hello', '-', 'world']
@Oceania2018
Copy link
Member

Which Tokenizor are you using? RegexTokenizer or TreebankTokenizer
https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP/Tokenize

@sdg002
Copy link
Author

sdg002 commented Sep 13, 2019

Tried with TreebankTokenizer.
RegexTokenizer is throwing an ArgumentNull exception. I guess, I am not using it the right way.

@sdg002 sdg002 changed the title Tokening "Hello-world" Tokenizng "Hello-world" Sep 13, 2019
@Oceania2018
Copy link
Member

@sdg002
Copy link
Author

sdg002 commented Sep 16, 2019

The RegexTokenizer was able to parse "hello-world".

Unfortunately, it also split 50,000 in the sentence this will cost 50,000 into 50 and 000.

Nevertheless, your efforts are commendable. I think I am asking too much at this moment.

@Oceania2018
Copy link
Member

It can be added easily to split digital with commas. You can do it and PR.

@13653415686
Copy link

这个项目太实用了,但是资料好少啊,我英文也不好,该怎么详细了解一下呢。已经运行成功了,就是不知道怎么该达到我想要的效果

@Oceania2018
Copy link
Member

请参考单元测试。

@13653415686
Copy link

请参考单元测试。

谢谢老大,你的联系方式可以给一个吗,我把单元测试里面的方法都运行了,基本都可以,但是不知道具体实现的是什么功能,英文不好,也大概推测不出来,还有wordvec_enu.bin这个文件,我没下载到。我看了好多nlp的代码,你这个功能最强大,最全,最适合我了。我是着急想全部看通,但是没有文档,我短时间琢磨不透啊。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants