-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persian language support for normalization and segmentation #304
Comments
@curquiza @Kerollmops I have issue on implementation, whatlang don't support Persian script. In Persian we have many unicodes, Arabic doesn't support it. for example: https://www.unicode.org/charts/PDF/U0600.pdf I can't pass normalization test for this issue and whatlang don't support Persian script for this. This repo is old and no have activity for add Persian script. I think better meilisearch make a fork of whatlang and update this crates. |
Hello @Ja7ad, Thank you for all the precision! |
Some character in Persian is not support in Arabic, Please see attachment screenshot. |
Yes, I understood that, For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context? If yes, is Persian a Language or a Script? |
Yes it's Persian language |
I agree with this issue. Arabic and Persian use many of the same letters, but they are quite different languages. They belong to different language families ( https://en.wikipedia.org/wiki/Language_family ) Persian has grammar that is closer to European languages than Arabic. As a Japanese person, I feel that the difference between Persian and Arabic is similar to one between Japanese and Chinese. |
Hello
Thank you for your continuous efforts in maintaining and improving Charabia. I’m writing to request support for the Persian language in your normalization and segmentation modules, similar to the existing support for Arabic.
Background
Persian (Farsi) is a widely spoken language, using the same script as Arabic with some additional letters. Although Persian shares many similarities with Arabic, there are important differences in orthography, morphology, and syntax that require distinct handling for proper text processing, especially in tasks like tokenization, normalization, and segmentation.
per
,fa
Feature Request
I would like to request the addition of Persian language support for:
Normalization:
Segmentation:
References
To aid in this implementation, here are the links to the current normalization and segmentation implementations for Arabic, which can serve as a starting point for Persian:
Conclusion
Implementing Persian language support would greatly benefit users who need to process Persian text accurately. Persian is distinct enough from Arabic that this feature would significantly improve text processing capabilities for Persian-speaking users. I’m happy to contribute in any way I can to support this effort.
The text was updated successfully, but these errors were encountered: