-
Notifications
You must be signed in to change notification settings - Fork 137
Support request: tokenization #13
Comments
It would be good to amke the pipeline more independent. |
Yes I do. I usually use the org.apache.lucene.analysis for various languages. |
great to know that.does the current pipeline actually gets something out that is not garbage for those langs? (Ive only played with a few of the most remarkable european languages) |
Go to check it but for languages that are mix of asian and english (e.g., wikipedia) usually smart chinese tokenizer from lucene works well and it's pretty fast and scalable |
@nick-magnini any recommendation on which tokenizer to use for this particular task:
|
Based on my experience, for wiki pages, since it's a mix for English and Chinese, SmartChineseAnalyzer works better. In addition Jieba is one of the best Chinese segmenters (tokenizers). Potentially to change Zh from traditional to simple and vice versa, Opencc is recommended. |
So then the best options are:
I've not done much of Asian languages, but you are suggesting the pipeline to be:
|
Now the choice is between Lucene or Jieba. In terms of scalability and efficiency, I'l vote for Lucene since it has CKJ support. |
Alright lets start with tokenization, So Im going to run the tool on |
@nick-magnini generating the model now. Here is a sample of the tokenization using SmartChineseAnalyzer. Worth knowing if it looks alright
|
Training: dimensions:300, min threshold: 10, window: 10 |
@nick-magnini model is trained, and some basic examples with entities similarities get what seems good results
Since it looks you are trying to build models with several tools, I will share the corpus + the model. |
|
Sorry for jumping so late in this discussion, but it might be a good call to implement something more generic, no ? The good thing about using Lucene Analyzers is that you could just use the analyzer for the corresponding locale and the job would be done. This would work for chinese, but also for check and other languages. |
@nick-magnini any chance you can evaluate the generated model before I jump into a refactor ? |
@nick-magnini any news on reviewing the given branch ? otherwise I will close this issue |
Thanks. Let me discover and explore. Thanks again. |
If you are a chinese speaker and you could generate a dataset similar to : https://github.com/arfon/word2vec/blob/master/questions-words.txt it would be great |
It would be great to add the org.apache.lucene.analysis for smarter tokenization for all languages. In this way, processing other languages such as Chinese is more sensible with your library.
The text was updated successfully, but these errors were encountered: