Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Arabic transliteration and a "fully-pointed Arabic" form #309

Open
ronaldtse opened this issue Jun 8, 2020 · 3 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

ronaldtse commented Jun 8, 2020

Some written scripts do not usually specify all phonemic elements in a word. These include Arabic, Syriac, and Hebrew.

In the transliteration of Arabic script to other scripts, the 3 short vowels and the shadda (double consonant) are missing. The fully expressed form of Arabic script is called "pointed script", but it still does not necessarily represent all linguistic elements (e.g. shadda). Most importantly, in the GNDB, most Arabic scripts are written in non-pointed form.

The only trusted mechanism today to fill in the 3 short vowels and the shadda, are through learned experience of conventions (or machine-learning) in the language and culture.

In order to transliterate an Arabic word through different transliteration system, we will need to first extract the missing linguistic information from existing transliterations (from the GNDB) to generate the "fully pointed Arabic" (Arabic script with full linguistic information). Once we have them, we can transliterate these words using any transliteration system.

Therefore the approach is:

  1. For every existing Arabic-transliterated pairs, generate the "fully pointed Arabic". e.g.
    (Makkah, َمكة) in the BGN system
    We need to extract the two short "a" vowels, and the 'kk' shadda.

  2. Using the "fully pointed Arabic", we can generate transliterations.

  3. Using the "fully pointed Arabic" and the original "unpointed Arabic", we can feed this into machine-learning (per language) to potentially allow a mapping from "unpointed Arabic" to "fully pointed Arabic".

Some rules:

  • If a transliteration character is written twice in a word, then add a shadda over the letter in the fully pointed Arabic.

  • If there are short vowels following consonants in the transliteration, add those short vowels to those consonants in the fully pointed Arabic.

  • If there are short vowels at the beginning of words, and the hamza is lacking in the Arabic, we need to add the hamza in the fully pointed Arabic.

This is a blocker to:
#244 #236 #219 #33 #32 #26 #25 #12 #11 #7

@ronaldtse ronaldtse added the enhancement New feature or request label Jun 8, 2020
@ronaldtse
Copy link
Contributor Author

The goal of this task is to transliterate Arabic place names using different transliteration systems. e.g. some systems write "Mekkah", some "Mecca"

In Arabic, in my very simplistic understanding, does not usually write out short vowels and the shadda.

In this screenshot, you can see that the Arabic does not write out all short vowels:
image

One would only be able to fill in those short vowels if he/she knows the language and context well. The GNDB, as extracted into https://github.com/interscript/geonames-transliteration-data contains a database of human-transliterated place names.

This place name database therefore already contains the short vowels and shadda information in the transliteration columns.

We wish to reverse transliterate these Latin script back to a "fully pointed form of Arabic", such as k => ك, kk => كّ.

With the "fully pointed form", we can then transliterate (forward) this "fully pointed form" using any Arabic transliteration system, such as, ALA (https://www.loc.gov/catdir/cpso/romanization/arabic.docx)

(In the given database, each row is an Arabic / Latin transliterated pair.)

For some other languages, it is not so complex: https://github.com/interscript/interscript/pull/304

Some are more complex: https://github.com/interscript/interscript/pull/258

This task is to:

  1. make the framework "work" with Arabic,
  2. enable the generation of fully pointed Arabic
  3. implement the Arabic => Latin transliteration systems

@ronaldtse
Copy link
Contributor Author

Ping @AhMohsen46

@ronaldtse
Copy link
Contributor Author

@AhMohsen46 Feel free to continue on #33 , however:

  1. You will want to work on the transliteration system with a backing data file.

For Arabic, you can see that:

  • ara_Arab2Latn_BGN_1956 has 26.6 MB
  • ara_Arab2Latn_ALA_1997 has 14 KB
  • the other systems only have 1 or few rows, which makes them hard to test

Screen Shot 2020-08-16 at 8 21 48 PM

For Persian,

  • fas_Arab2Latn_BGN_1958 has 31.2 MB
  • fas_Arab2Latn_ALA_1997 has 24 KB
  • fas_Arab2Latn_AMMI_1959 has 4 KB
  • fas_Arab2Latn_NCO_2004 has 5 KB

Screen Shot 2020-08-16 at 8 24 08 PM

  1. You will also need to implement reverse transliteration. Right now, the transliteration systems implemented cannot be used in a reversible way. We currently don't have a method of indicating that a rule can be performed in reverse.

  2. Also note that not all transliteration data in geonames-transliteration-data are correct -- there are some mislabeled entries or wrongly transliterated entries (they were done by humans). So the script you create should take that into consideration (i.e. don't fail!)

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants