You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project has supported the Chinese, Japanese and Korean typographies. But there still a deficient in full-text search ability.
It can not process the Japanese and Korean language sentences will indexing the search database.
I have researched some points in Japanese processing, the features is completly different from Latin-like language and Chinese:
Japanese words has two writting patterns: Kanji and Kana. In other words, you need to enroll two words disperately. For example, the "Japanese" has 「にほんご」and 「日本語」 two forms.
Japanese words has more tense and morphological changes.
Some Japanese sentences written purely in Chinese characters also conform to the Chinese expression method, and it is difficult to distinguish whether they are Japanese sentences or Chinese sentences.
I checked many tokenizers in the open source community, but it seems that there is very little content in this area in the Japanese community, and the information is relatively poor.
I only found tiny-segmenter.js and kuromoji.js, but they have been poorly maintained for many years.
I'm seeking more other ideas now...
The text was updated successfully, but these errors were encountered:
PrinOrange
changed the title
[NEXT STEPS] The supports for multi sia languages (Japanese, Korean and Chinese) full text search.
[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search.
Sep 7, 2024
Considering a Japanese word has two forms, Kanji and Kana.
So when indexing text by words, what they have in common is that they can all use Roman letters as unique identifiers.
My idea is to first segment the Japanese text sentences, and then convert the obtained words, whether they are kana or kanji, into romanization.
Then the text is indexed based on the romanization.
This solves the problem of matching two forms of Japanese words.
This project has supported the Chinese, Japanese and Korean typographies. But there still a deficient in full-text search ability.
It can not process the Japanese and Korean language sentences will indexing the search database.
I have researched some points in Japanese processing, the features is completly different from Latin-like language and Chinese:
I checked many tokenizers in the open source community, but it seems that there is very little content in this area in the Japanese community, and the information is relatively poor.
I only found tiny-segmenter.js and kuromoji.js, but they have been poorly maintained for many years.
I'm seeking more other ideas now...
The text was updated successfully, but these errors were encountered: