[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. #3

PrinOrange · 2024-04-14T13:24:21Z

This project has supported the Chinese, Japanese and Korean typographies. But there still a deficient in full-text search ability.
It can not process the Japanese and Korean language sentences will indexing the search database.

I have researched some points in Japanese processing, the features is completly different from Latin-like language and Chinese:

Japanese words has two writting patterns: Kanji and Kana. In other words, you need to enroll two words disperately. For example, the "Japanese" has 「にほんご」and 「日本語」 two forms.
Japanese words has more tense and morphological changes.
Some Japanese sentences written purely in Chinese characters also conform to the Chinese expression method, and it is difficult to distinguish whether they are Japanese sentences or Chinese sentences.

I checked many tokenizers in the open source community, but it seems that there is very little content in this area in the Japanese community, and the information is relatively poor.

I only found tiny-segmenter.js and kuromoji.js, but they have been poorly maintained for many years.

I'm seeking more other ideas now...

PrinOrange · 2024-09-07T00:20:10Z

Considering a Japanese word has two forms, Kanji and Kana.
So when indexing text by words, what they have in common is that they can all use Roman letters as unique identifiers.
My idea is to first segment the Japanese text sentences, and then convert the obtained words, whether they are kana or kanji, into romanization.
Then the text is indexed based on the romanization.
This solves the problem of matching two forms of Japanese words.

PrinOrange pinned this issue Apr 18, 2024

PrinOrange changed the title ~~[NEXT STEPS] The supports for multi sia languages (Japanese, Korean and Chinese) full text search.~~ [NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. #3

[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. #3

PrinOrange commented Apr 14, 2024 •

edited

Loading

PrinOrange commented Sep 7, 2024

[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. #3

[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. #3

Comments

PrinOrange commented Apr 14, 2024 • edited Loading

PrinOrange commented Sep 7, 2024

PrinOrange commented Apr 14, 2024 •

edited

Loading