Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. #3

Open
PrinOrange opened this issue Apr 14, 2024 · 1 comment

Comments

@PrinOrange
Copy link
Owner

PrinOrange commented Apr 14, 2024

This project has supported the Chinese, Japanese and Korean typographies. But there still a deficient in full-text search ability.
It can not process the Japanese and Korean language sentences will indexing the search database.

I have researched some points in Japanese processing, the features is completly different from Latin-like language and Chinese:

  • Japanese words has two writting patterns: Kanji and Kana. In other words, you need to enroll two words disperately. For example, the "Japanese" has 「にほんご」and 「日本語」 two forms.
  • Japanese words has more tense and morphological changes.
  • Some Japanese sentences written purely in Chinese characters also conform to the Chinese expression method, and it is difficult to distinguish whether they are Japanese sentences or Chinese sentences.

I checked many tokenizers in the open source community, but it seems that there is very little content in this area in the Japanese community, and the information is relatively poor.

I only found tiny-segmenter.js and kuromoji.js, but they have been poorly maintained for many years.

I'm seeking more other ideas now...

@PrinOrange PrinOrange pinned this issue Apr 18, 2024
@PrinOrange PrinOrange changed the title [NEXT STEPS] The supports for multi sia languages (Japanese, Korean and Chinese) full text search. [NEXT STEPS] The supports for multi East-Asia languages (Japanese, Korean and Chinese) full text search. Sep 7, 2024
@PrinOrange
Copy link
Owner Author

Considering a Japanese word has two forms, Kanji and Kana.
So when indexing text by words, what they have in common is that they can all use Roman letters as unique identifiers.
My idea is to first segment the Japanese text sentences, and then convert the obtained words, whether they are kana or kanji, into romanization.
Then the text is indexed based on the romanization.
This solves the problem of matching two forms of Japanese words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant