HTTYB 0.13

How to train your Bicleaner

(For Bicleaner v0.13 and above)

Content

Intro
- What you will need
- What you will get
Data preparation
Train Bicleaner
Bicleaning a corpus
Software
- Bicleaner
- Mgiza
- Moses
- KenLM
- tmxt
- Apertium
- Bifixer

Intro

In this article we'll develop an example to illustrate the recommended way to train Bicleaner from scratch. Of course you can follow your own way, but let us unveil our secrets on Bicleaner training (trust us, we have done this a zillion times before).

If after reading this guide you are still having questions or needing clarification, please don't hesitate to open a new issue.

Let's assume you'd like to train a Bicleaner for English-Icelandic (en-is)

What you will need

A probabilistic dictionary for English->Icelandic (is_word en_word prob)
A probabilistic dictionary for Icelandic->English (en_word is_word prob)
A training corpus (ideally, around 100K of very clean en-is parallel sentences)

What you will get

A English-Icelandic classifier
A character model for English
A character model for Icelandic
A yaml file with metadata

If you already have all the ingredients (training corpus and dictionaries) beforehand, you won't need to do anything else before running the training command. If not, don't worry: below we'll show you how to get them all.

Data preparation

Starting point: a parallel corpus

Good news: You can build everything needed to train Bicleaner from a single parallel corpus.

If you don't have a corpus large enough to build probabilistic dictionaries (a few million lines), you can download smaller corpus from Opus and cat them to get a larger one.
If you have TMXs, you can convert them to plain text by using tmxt:

python3.7 tmxt/tmxt.py --codelist en,is smallcorpus.en-is.tmx smallcorpus.en-is.txt

If your corpora happens to be pre-tokenized (it happens sometimes when downloading from Opus), you need to detokenize:

cut -f1 smallcorpus.en-is.txt > smallcorpus.en-is.en
cut -f2 smallcorpus.en-is.txt > smallcorpus.en-is.is
moses/tokenizer/detokenizer.perl -l en < smallcorpus.en-is.en > smallcorpus.en-is.detok.en
moses/tokenizer/detokenizer.perl -l is < smallcorpus.en-is.is > smallcorpus.en-is.detok.is
paste smallcorpus.en-is.is  smallcorpus.en-is.detok.is > smallcorpus.en-is

If you do not have enough sentences in your source or target languages, you can try translating from another language by using Apertium. For example, if you want to translate a English-Swedish corpus for English-Icelandic:

cut -f1 corpus.en-sv > corpus.en-sv.en
cut -f2 corpus.en-sv > corpus.en-sv.sv
cat corpus.en-sv.sv | apertium-destxt -i | apertium -f none -u swe-isl | apertium-retxt > corpus.en-is.is
paste corpus.en-sv.en corpus.en-is.is > corpus.en-is

Probabilistic dictionaries

For this, you need a parallel corpus of several million of sentences. First, tokenize and lowerize the corpus:

cat  bigcorpus.en-is| cut -f1 > bigcorpus.en-is.en
cat  bigcorpus.en-is| cut -f2 > bigcorpus.en-is.is

moses/tokenizer/tokenizer.perl -l en -no-escape < bigcorpus.en-is.en > bigcorpus.en-is.tok.en
moses/tokenizer/tokenizer.perl -l is -no-escape < bigcorpus.en-is.is > bigcorpus.en-is.tok.is

sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.en > bigcorpus.en-is.tok.low.en
sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.is > bigcorpus.en-is.tok.low.is

mv bigcorpus.en-is.tok.low.en bigcorpus.en-is.clean.en
mv bigcorpus.en-is.tok.low.is bigcorpus.en-is.clean.is

And then, build the probabilistic dictionaries

mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir /your/working/directory  --corpus bigcorpus.en-is.clean -e en  -f is --mgiza -mgiza-cpus=16 --parallel --first-step 1 --last-step 4 --external-bin-dir /your/path/here/mgiza/mgizapp/bin/

Your probabilistic dictionaries should contain this kind of entries:

lex.e2f: Probability of an English word translating into a given Icelandic word. In this example, rediscover can be translated as enduruppgötva or verðskuldið with the same probability (0.5)

...
enduruppgötva rediscover 0.5000000
verðskuldið rediscover 0.5000000
...

lex.f2e: Probability of an icelandic word translating into a given english word. In this example, rediscover can be the translation of enduruppgötva with a 0.33 probability, or the translation of verðskuldið with a probability of 0.12.

...
rediscover enduruppgötva 0.3333333
rediscover verðskuldið 0.1250000
...

At this point, you could just gzip your dictionaries:

gzip lex.e2f -c > dict-en.gz
gzip lex.f2e -c > dict-is.gz

and you'll have the required two probabilistic dictionaries fully compatible with Bicleaner, but we recommend to prune them to remove very uncommon dictionary entries (for example, those whose probability is less than 10 times lower than the maximum one).

python3.7 bicleaner/utils/dict_pruner.py lex.e2f dict-en.gz -n 10 -g 
python3.7 bicleaner/utils/dict_pruner.py lex.f2e dict-is.gz -n 10 -g

Please note that both target and source words in probabilistic bilingual dictionaries must be single words.

Training corpus

If you have a super clean parallel corpus, containing around 100K parallel sentences, you can skip this part. If not, you can build a cleaner corpus from a not-so-clean parallel corpus by using Bifixer and the Bicleaner Hardrules.

First, apply Bifixer:

python3.7 bifixer/bifixer/bifixer.py --scol 1 --tcol 2 --ignore_duplicates corpus.en-is corpus.en-is.bifixed en is

Then, apply the hardrules:

python3.7 bicleaner/bicleaner/bicleaner_hardrules.py corpus.en-is.bifixed corpus.en-is.annotated -s en -t is --scol 1 --tcol 2 --annotated_output --disable_lm_filter

If any of your source or target languages is easily mistaken with other similar languages (for example, Norwegian and Danish, Galician and Portuguese...), you may need to use the --disable_lang_ident when running Hardrules. You can detect if this is happening by running:

cat corpus.en-is.annotated | awk -F'\t' '{print $4}' | sort | uniq -c | sort -nr

If language-related annotations are high (c_different_language, c_reliable_long_language(right, targetlang)and/or c_reliable_long_language(left, sourcelang)): you are probably experiencing this issue (so you really want to use the --disable_lang_ident flag also for training)

Once you have an annotated version of your corpus, you can get the cleaner parallel sentences and use these as a training corpus (100K sentences is a good number):

cat corpus.en-is.annotated  | grep "keep$" |  shuf -n 100000 | cut -f1,2 > trainingcorpus.en-is

Train Bicleaner

The most commonly used command (and the one you probably want to use) is the following:

python3.7 bicleaner/bicleaner_train.py \
trainingcorpus.en-is \
--treat_oovs --normalize_by_length \
-s en -t is \
-d dict-en.gz -D dict-is.gz \
-b 1000 -c en-is.classifier \
-g 50000 -w 50000 \
-m en-is.yaml \
--classifier_type random_forest \
--lm_training_file_sl lmtrain.en-is.en --lm_training_file_tl lmtrain.en-is.is \
--lm_file_sl model.en-is.en  --lm_file_tl model.en-is.is

Remember to check in the Readme all the available options and choose those that are the most useful for you.

Tip: If you plan to distribute your 'language pack', you'll want to modify en-is.yaml to use relative paths instead of absolute.

Bicleaning a corpus:

At this point, you probably want to try your freshly trained Bicleaner to clean an actual corpus. Just run:

python3.7 bicleaner/bicleaner/bicleaner_classifier_full.py testcorpus.en-is testcorpus.en-is.classified en-is.yaml --scol 1 --tcol 2

After running Bicleaner, you'll have a new file (testcorpus.en-is.classified), having the same content as the input file (testcorpus.en-is) plus an extra column. This new column contains the scores given by the classifier to each pair of parallel sentences. If the score is 0, it means that the sentence was discarded by the Hardrules filter or the language model. If the score is above 0, it means that it made it to the classifier, and the closer to 1 the better is the sentence. For most languages (and distributed language packs), we consider a sentence to be very likely a good sentence when its score is above 0.7 .

Software

Bicleaner

Bicleaner is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

Installation

Bicleaner works with Python3.6+ and can be installed with pip:

python3.7 -m pip install bicleaner

You can also download it from github :

git clone https://github.com/bitextor/bicleaner
cd bicleaner
python3.7 -m pip install -r requirements

It also requires KenLM with support for 7-gram language models:

git clone https://github.com/kpu/kenlm
cd kenlm
python3.7 -m pip install . --install-option="--max_order 7"
cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/prefix/path
make -j all install

Mgiza

Mgiza is a word alignment tool, that we use to build probabilistic dictionaries.

Installation

git clone https://github.com/moses-smt/mgiza.git
cd mgiza/mgizapp
cmake .
make
make install
export PATH=$PATH:/your/path/here/mgiza/mgizapp/bin

Moses

Moses is a statistical machine translation system. We use it for tokenization and (together with Mgiza) for probabilistic dictionary building.

Installation

git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder
./bjam -j32
cp /your/path/here/mgiza/experimental/alignment-enabled/MGIZA/scripts/merge_alignment.py /your/path/here/mgiza/mgizapp/bin/

KenLM

We use KenLM to build the character language models needed in Bicleaner.

Installation

git clone https://github.com/kpu/kenlm
cd kenlm
python3.7 -m pip install . --install-option="--max_order 7"
cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/python/env/path/here/
make -j all install

tmxt

tmxt is a tool that extract plain text parallel corpora from TMX files.

Installation

git clone http://github.com/sortiz/tmxt
python3.7 -m pip install -r tmxt/requirements.txt

Two tools are available: tmxplore.py (that determines the language codes available inside a TMX file) and tmxt.py, that transforms the TMX to a tab-separated text file.

Apertium

Apertium is a platform for developing rule-based machine translation systems. It can be useful to translate to a given language when you do not have enough parallel text.

Installation

In Ubuntu and other Debian-like operating systems:

wget http://apertium.projectjj.com/apt/install-nightly.sh
sudo bash install-nightly.sh
sudo apt-get update
sudo apt-get install apertium-LANGUAGE-PAIR

(choose your apropiate apertium-LANGUAGE-PAIR from the list under apt search apertium)

For other systems, please read Apertium documentation.

Bifixer

Bifixer is a tool that fixes bitexts and tags near-duplicates for removal. It's useful to fix errors in our training corpus.

Installation

git clone https://github.com/bitextor/bifixer.git
python3.7 -m pip install -r bifixer/requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTYB 0.13

How to train your Bicleaner

Content

Intro

What you will need

What you will get

Data preparation

Starting point: a parallel corpus

Probabilistic dictionaries

Training corpus

Train Bicleaner

Bicleaning a corpus:

Software

Bicleaner

Installation

Mgiza

Installation

Moses

Installation

KenLM

Installation

tmxt

Installation

Apertium

Installation

Bifixer

Installation

Clone this wiki locally