A technology to design transliteration systems from scratch, integrative of language experts workflow with a graphical editor.
Transliteration is the process of transforming a language into another script, transforming the letters such as to maintain the sounds.
No corpora are available with Farsi at the time of the creation of this library, and so we needed to build the transliteration strategy from scratch.
The building of a proper library needs the composition of mappings of language roots, lemmas and affixes into constructions via a number of rules. Additional NLP components are involved: Part of Speech tagger, lemmatizer and stemming.
In order to make the code easier to edit, reusable and with less dependency on the developer, we looked for some way to empower the linguist and encapsulate both dev.'s' and linguist’s workflows.
We also needed to make production code independant of some python libraries and details and easy to build. Our codes are generated from data in ways explained below.
The Library/strategy can be tried on a simple, QUICKSTART. A more complete version can be found under documentation.
We have summarised the reflection and results into a short ARTICLE. and also a PRESENTATION.
For achieving this, we propose the following approach:
-
Graphical Editor
-
allowing a linguist to easily design and reshape complex rules
-
-
"Learn" Graphs
-
software capable to convert above mentionned graphical rules into code.
-
-
Transliteration of massive Farsi dataset
-
We have put together massive dataset for this purpose.
-
-
Training of neural network architecture
-
transliteration is approached as translation and with similar tools.
-
there are 2 repositories with codes at the char or word levels.
-
More detailed resources can be found in the form of a blogpost, as well as a short article.
The Library is structured as described below. More details are provided within the subfolders.
The workflow is also sketched.
-
-
code: creates customised transliteration code from expert diagrams and mappings code currently support lucidcharts, whereas basic format has to be followed for diagrams design
-
tests and benchmarks: basics benchmarks can run from a simple data format, allowing for easy evaluation of new diagrams or tweaks in the code.
-
data + transliteration for learning: We provide with farsi datasets that can be transliterated by our code.
-
-
-
training of transformer architecture, transliteration learned as transliteration
-
export and decomposition of best models into onnx format, so that they can be used in production
-
-
-
production code in ruby
-
importing and combining transformer models trained in python to transliterate samples or files
-
Notice that two versions were developed for the encoding and are on different branches: * farsi word level * char level encoding The char level encoding was made possible using a simplified version inspired and modified from Andrej Karpathy.