transliteration-learner-from-graphs

A technology to design transliteration systems from scratch, integrative of language experts workflow with a graphical editor.

Motivation & Approach

Motivation

Transliteration is the process of transforming a language into another script, transforming the letters such as to maintain the sounds.

No corpora are available with Farsi at the time of the creation of this library, and so we needed to build the transliteration strategy from scratch.

The building of a proper library needs the composition of mappings of language roots, lemmas and affixes into constructions via a number of rules. Additional NLP components are involved: Part of Speech tagger, lemmatizer and stemming.

In order to make the code easier to edit, reusable and with less dependency on the developer, we looked for some way to empower the linguist and encapsulate both dev.'s' and linguist’s workflows.

We also needed to make production code independant of some python libraries and details and easy to build. Our codes are generated from data in ways explained below.

Quickstart & Documentation

The Library/strategy can be tried on a simple, QUICKSTART. A more complete version can be found under documentation.

Paper and Presentation

We have summarised the reflection and results into a short ARTICLE. and also a PRESENTATION.

Approach

For achieving this, we propose the following approach:

Graphical Editor
- allowing a linguist to easily design and reshape complex rules
"Learn" Graphs
- software capable to convert above mentionned graphical rules into code.
Transliteration of massive Farsi dataset
- We have put together massive dataset for this purpose.
Training of neural network architecture
- transliteration is approached as translation and with similar tools.
- there are 2 repositories with codes at the char or word levels.

More detailed resources can be found in the form of a blogpost, as well as a short article.

Library + Workflow

The Library is structured as described below. More details are provided within the subfolders.

The workflow is also sketched.

learn-graph
- code: creates customised transliteration code from expert diagrams and mappings code currently support lucidcharts, whereas basic format has to be followed for diagrams design
- tests and benchmarks: basics benchmarks can run from a simple data format, allowing for easy evaluation of new diagrams or tweaks in the code.
- data + transliteration for learning: We provide with farsi datasets that can be transliterated by our code.
python-nnets-torch
- training of transformer architecture, transliteration learned as transliteration
- export and decomposition of best models into onnx format, so that they can be used in production
lib
- production code in ruby
- importing and combining transformer models trained in python to transliterate samples or files

Notice that two versions were developed for the encoding and are on different branches: * farsi word level * char level encoding The char level encoding was made possible using a simplified version inspired and modified from Andrej Karpathy.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
config		config
docs		docs
learn-graph		learn-graph
lib		lib
python-nnets-char_level		python-nnets-char_level
python-nnets-demo-word-level		python-nnets-demo-word-level
resources		resources
DOCUMENTATION.adoc		DOCUMENTATION.adoc
QUICKSTART.adoc		QUICKSTART.adoc
README.adoc		README.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

transliteration-learner-from-graphs

Motivation & Approach

Motivation

Quickstart & Documentation

Paper and Presentation

Approach

Library + Workflow

About

Releases

Packages

Contributors 2

Languages

interscript/transliteration-learner-from-graphs

Folders and files

Latest commit

History

Repository files navigation

transliteration-learner-from-graphs

Motivation & Approach

Motivation

Quickstart & Documentation

Paper and Presentation

Approach

Library + Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages