Persian language support for normalization and segmentation #304

Ja7ad · 2024-08-12T09:32:51Z

Hello

Thank you for your continuous efforts in maintaining and improving Charabia. I’m writing to request support for the Persian language in your normalization and segmentation modules, similar to the existing support for Arabic.

Background

Persian (Farsi) is a widely spoken language, using the same script as Arabic with some additional letters. Although Persian shares many similarities with Arabic, there are important differences in orthography, morphology, and syntax that require distinct handling for proper text processing, especially in tasks like tokenization, normalization, and segmentation.

Language ISO code is: per, fa

Feature Request

I would like to request the addition of Persian language support for:

Normalization:
- Handling Persian-specific characters, such as "گ", "چ", "پ", "ژ".
- Differentiating between Arabic and Persian diacritics and letters where applicable (e.g., "ی" vs. "ي", "ک" vs. "ك").
- Normalizing Persian numerals (۰-۹) and ensuring compatibility with Arabic numerals where necessary.
Segmentation:
- Properly segmenting Persian text based on its unique grammatical structure.
- Handling word boundaries and tokenization in the context of Persian, considering the language's syntax and morphology.

References

To aid in this implementation, here are the links to the current normalization and segmentation implementations for Arabic, which can serve as a starting point for Persian:

Conclusion

Implementing Persian language support would greatly benefit users who need to process Persian text accurately. Persian is distinct enough from Arabic that this feature would significantly improve text processing capabilities for Persian-speaking users. I’m happy to contribute in any way I can to support this effort.

The text was updated successfully, but these errors were encountered:

Ja7ad · 2024-08-18T09:31:40Z

@curquiza @Kerollmops I have issue on implementation, whatlang don't support Persian script.

In Persian we have many unicodes, Arabic doesn't support it. for example:

https://www.unicode.org/charts/PDF/U0600.pdf

I can't pass normalization test for this issue and whatlang don't support Persian script for this.

This repo is old and no have activity for add Persian script.

Ja7ad@f9b58e0

Ja7ad@029423a

I think better meilisearch make a fork of whatlang and update this crates.

ManyTheFish · 2024-08-27T07:35:48Z

Hello @Ja7ad,
WhatLang doesn't support Persian script, but which script is assigned to Persian instead? Arabic?
If I understand well, Arabic and Persian share a lot of characters; if that's the case, I'd like to consider them as the same script for Charabia. This would avoid splitting words into parts.
If everything were considered Arabic, would it be relevant to apply this normalization to any Arabic Language?

Thank you for all the precision!

Ja7ad · 2024-08-27T07:38:34Z

Hello @Ja7ad, WhatLang doesn't support Persian script, but which script is assigned to Persian instead? Arabic? If I understand well, Arabic and Persian share a lot of characters; if that's the case, I'd like to consider them as the same script for Charabia. This would avoid splitting words into parts. If everything were considered Arabic, would it be relevant to apply this normalization to any Arabic Language?

Thank you for all the precision!

Some character in Persian is not support in Arabic, Please see attachment screenshot.

ManyTheFish · 2024-08-28T06:48:54Z

Yes, I understood that,
however, the technical approach of Charabia is a simplification of the real linguistical state of Languages.
For instance, the characters you listed before are considered Arabic by Charabia even if it's not exactly true.
But considering Persian and Arabic as the same script is convenient if they share a lot of common characters.

For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script Cj containing both scripts, avoiding splitting a word in 2 because it contains different scripts.

The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context?

If yes, is Persian a Language or a Script?
If no, normalizing your character anyway should work

Ja7ad · 2024-08-28T07:03:18Z

Yes, I understood that,
however, the technical approach of Charabia is a simplification of the real linguistical state of Languages.
For instance, the characters you listed before are considered Arabic by Charabia even if it's not exactly true.
But considering Persian and Arabic as the same script is convenient if they share a lot of common characters.

For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script Cj containing both scripts, avoiding splitting a word in 2 because it contains different scripts.

The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context?

If yes, is Persian a Language or a Script?
If no, normalizing your character anyway should work

Yes it's Persian language
Even segmentation is different.

kamiyn · 2024-08-28T07:22:07Z

I agree with this issue.

Arabic and Persian use many of the same letters, but they are quite different languages. They belong to different language families ( https://en.wikipedia.org/wiki/Language_family )
and their grammar is completely different.

Persian has grammar that is closer to European languages than Arabic.

As a Japanese person, I feel that the difference between Persian and Arabic is similar to one between Japanese and Chinese.

curquiza added the good first issue Good for newcomers label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persian language support for normalization and segmentation #304

Persian language support for normalization and segmentation #304

Ja7ad commented Aug 12, 2024 •

edited

Loading

Ja7ad commented Aug 18, 2024 •

edited

Loading

ManyTheFish commented Aug 27, 2024

Ja7ad commented Aug 27, 2024

ManyTheFish commented Aug 28, 2024

Ja7ad commented Aug 28, 2024

kamiyn commented Aug 28, 2024

Persian language support for normalization and segmentation #304

Persian language support for normalization and segmentation #304

Comments

Ja7ad commented Aug 12, 2024 • edited Loading

Background

Feature Request

References

Conclusion

Ja7ad commented Aug 18, 2024 • edited Loading

ManyTheFish commented Aug 27, 2024

Ja7ad commented Aug 27, 2024

ManyTheFish commented Aug 28, 2024

Ja7ad commented Aug 28, 2024

kamiyn commented Aug 28, 2024

Ja7ad commented Aug 12, 2024 •

edited

Loading

Ja7ad commented Aug 18, 2024 •

edited

Loading