-
Notifications
You must be signed in to change notification settings - Fork 22
HTTYB 0.13
(For Bicleaner v0.13 and above)
In this article we'll develop an example to illustrate the recommended way to train Bicleaner from scratch. Of course you can follow your own way, but let us unveil our secrets on Bicleaner training (trust us, we have done this a zillion times before).
If after reading this guide you are still having questions or needing clarification, please don't hesitate to open a new issue.
Let's assume you'd like to train a Bicleaner for English-Icelandic (en-is)
- A probabilistic dictionary for English->Icelandic (is_word en_word prob)
- A probabilistic dictionary for Icelandic->English (en_word is_word prob)
- A training corpus (ideally, around 100K of very clean en-is parallel sentences)
- A English-Icelandic classifier
- A character model for English
- A character model for Icelandic
- A yaml file with metadata
If you already have all the ingredients (training corpus and dictionaries) beforehand, you won't need to do anything else before running the training command. If not, don't worry: below we'll show you how to get them all.
Good news: You can build everything needed to train Bicleaner from a single parallel corpus.
-
If you don't have a corpus large enough to build probabilistic dictionaries (a few million lines), you can download smaller corpus from Opus and
cat
them to get a larger one. -
If you have TMXs, you can convert them to plain text by using
tmxt
:
python3.7 tmxt/tmxt.py --codelist en,is smallcorpus.en-is.tmx smallcorpus.en-is.txt
- If your corpora happens to be pre-tokenized (it happens sometimes when downloading from Opus), you need to detokenize:
cut -f1 smallcorpus.en-is.txt > smallcorpus.en-is.en
cut -f2 smallcorpus.en-is.txt > smallcorpus.en-is.is
moses/tokenizer/detokenizer.perl -l en < smallcorpus.en-is.en > smallcorpus.en-is.detok.en
moses/tokenizer/detokenizer.perl -l is < smallcorpus.en-is.is > smallcorpus.en-is.detok.is
paste smallcorpus.en-is.is smallcorpus.en-is.detok.is > smallcorpus.en-is
- If you do not have enough sentences in your source or target languages, you can try translating from another language by using Apertium. For example, if you want to translate a English-Swedish corpus for English-Icelandic:
cut -f1 corpus.en-sv > corpus.en-sv.en
cut -f2 corpus.en-sv > corpus.en-sv.sv
cat corpus.en-sv.sv | apertium-destxt -i | apertium -f none -u swe-isl | apertium-retxt > corpus.en-is.is
paste corpus.en-sv.en corpus.en-is.is > corpus.en-is
For this, you need a parallel corpus of several million of sentences. First, tokenize and lowerize the corpus:
cat bigcorpus.en-is| cut -f1 > bigcorpus.en-is.en
cat bigcorpus.en-is| cut -f2 > bigcorpus.en-is.is
moses/tokenizer/tokenizer.perl -l en -no-escape < bigcorpus.en-is.en > bigcorpus.en-is.tok.en
moses/tokenizer/tokenizer.perl -l is -no-escape < bigcorpus.en-is.is > bigcorpus.en-is.tok.is
sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.en > bigcorpus.en-is.tok.low.en
sed 's/[[:upper:]]*/\L&/g' < bigcorpus.en-is.tok.is > bigcorpus.en-is.tok.low.is
mv bigcorpus.en-is.tok.low.en bigcorpus.en-is.clean.en
mv bigcorpus.en-is.tok.low.is bigcorpus.en-is.clean.is
And then, build the probabilistic dictionaries
mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir /your/working/directory --corpus bigcorpus.en-is.clean -e en -f is --mgiza -mgiza-cpus=16 --parallel --first-step 1 --last-step 4 --external-bin-dir /your/path/here/mgiza/mgizapp/bin/
Your probabilistic dictionaries should contain this kind of entries:
- lex.e2f:
Probability of an English word translating into a given Icelandic word. In this example,
rediscover
can be translated asenduruppgötva
orverðskuldið
with the same probability (0.5
)
...
enduruppgötva rediscover 0.5000000
verðskuldið rediscover 0.5000000
...
- lex.f2e:
Probability of an icelandic word translating into a given english word. In this example,
rediscover
can be the translation ofenduruppgötva
with a0.33
probability, or the translation ofverðskuldið
with a probability of0.12
.
...
rediscover enduruppgötva 0.3333333
rediscover verðskuldið 0.1250000
...
At this point, you could just gzip your dictionaries:
gzip lex.e2f -c > dict-en.gz
gzip lex.f2e -c > dict-is.gz
and you'll have the required two probabilistic dictionaries fully compatible with Bicleaner, but we recommend to prune them to remove very uncommon dictionary entries (for example, those whose probability is less than 10 times lower than the maximum one).
python3.7 bicleaner/utils/dict_pruner.py lex.e2f dict-en.gz -n 10 -g
python3.7 bicleaner/utils/dict_pruner.py lex.f2e dict-is.gz -n 10 -g
Please note that both target and source words in probabilistic bilingual dictionaries must be single words.
If you have a super clean parallel corpus, containing around 100K parallel sentences, you can skip this part. If not, you can build a cleaner corpus from a not-so-clean parallel corpus by using Bifixer and the Bicleaner Hardrules.
First, apply Bifixer:
python3.7 bifixer/bifixer/bifixer.py --scol 1 --tcol 2 --ignore_duplicates corpus.en-is corpus.en-is.bifixed en is
Then, apply the hardrules:
python3.7 bicleaner/bicleaner/bicleaner_hardrules.py corpus.en-is.bifixed corpus.en-is.annotated -s en -t is --scol 1 --tcol 2 --annotated_output --disable_lm_filter
If any of your source or target languages is easily mistaken with other similar languages (for example, Norwegian and Danish, Galician and Portuguese...), you may need to use the --disable_lang_ident
when running Hardrules. You can detect if this is happening by running:
cat corpus.en-is.annotated | awk -F'\t' '{print $4}' | sort | uniq -c | sort -nr
If language-related annotations are high (c_different_language
, c_reliable_long_language(right, targetlang)
and/or c_reliable_long_language(left, sourcelang)
): you are probably experiencing this issue (so you really want to use the --disable_lang_ident
flag also for training)
Once you have an annotated version of your corpus, you can get the cleaner parallel sentences and use these as a training corpus (100K sentences is a good number):
cat corpus.en-is.annotated | grep "keep$" | shuf -n 100000 | cut -f1,2 > trainingcorpus.en-is
The most commonly used command (and the one you probably want to use) is the following:
python3.7 bicleaner/bicleaner_train.py \
trainingcorpus.en-is \
--treat_oovs --normalize_by_length \
-s en -t is \
-d dict-en.gz -D dict-is.gz \
-b 1000 -c en-is.classifier \
-g 50000 -w 50000 \
-m en-is.yaml \
--classifier_type random_forest \
--lm_training_file_sl lmtrain.en-is.en --lm_training_file_tl lmtrain.en-is.is \
--lm_file_sl model.en-is.en --lm_file_tl model.en-is.is
Remember to check in the Readme all the available options and choose those that are the most useful for you.
Tip: If you plan to distribute your 'language pack', you'll want to modify en-is.yaml
to use relative paths instead of absolute.
At this point, you probably want to try your freshly trained Bicleaner to clean an actual corpus. Just run:
python3.7 bicleaner/bicleaner/bicleaner_classifier_full.py testcorpus.en-is testcorpus.en-is.classified en-is.yaml --scol 1 --tcol 2
After running Bicleaner, you'll have a new file (testcorpus.en-is.classified
), having the same content as the input file (testcorpus.en-is
) plus an extra column. This new column contains the scores given by the classifier to each pair of parallel sentences. If the score is 0
, it means that the sentence was discarded by the Hardrules filter or the language model. If the score is above 0, it means that it made it to the classifier, and the closer to 1 the better is the sentence. For most languages (and distributed language packs), we consider a sentence to be very likely a good sentence when its score is above 0.7
.
Bicleaner is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.
Bicleaner works with Python3.6+ and can be installed with pip
:
python3.7 -m pip install bicleaner
You can also download it from github :
git clone https://github.com/bitextor/bicleaner
cd bicleaner
python3.7 -m pip install -r requirements
It also requires KenLM with support for 7-gram language models:
git clone https://github.com/kpu/kenlm
cd kenlm
python3.7 -m pip install . --install-option="--max_order 7"
cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/prefix/path
make -j all install
Mgiza is a word alignment tool, that we use to build probabilistic dictionaries.
git clone https://github.com/moses-smt/mgiza.git
cd mgiza/mgizapp
cmake .
make
make install
export PATH=$PATH:/your/path/here/mgiza/mgizapp/bin
Moses is a statistical machine translation system. We use it for tokenization and (together with Mgiza) for probabilistic dictionary building.
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder
./bjam -j32
cp /your/path/here/mgiza/experimental/alignment-enabled/MGIZA/scripts/merge_alignment.py /your/path/here/mgiza/mgizapp/bin/
We use KenLM to build the character language models needed in Bicleaner.
git clone https://github.com/kpu/kenlm
cd kenlm
python3.7 -m pip install . --install-option="--max_order 7"
cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/python/env/path/here/
make -j all install
tmxt is a tool that extract plain text parallel corpora from TMX files.
git clone http://github.com/sortiz/tmxt
python3.7 -m pip install -r tmxt/requirements.txt
Two tools are available: tmxplore.py
(that determines the language codes available inside a TMX file) and tmxt.py
, that transforms the TMX to a tab-separated text file.
Apertium is a platform for developing rule-based machine translation systems. It can be useful to translate to a given language when you do not have enough parallel text.
In Ubuntu and other Debian-like operating systems:
wget http://apertium.projectjj.com/apt/install-nightly.sh
sudo bash install-nightly.sh
sudo apt-get update
sudo apt-get install apertium-LANGUAGE-PAIR
(choose your apropiate apertium-LANGUAGE-PAIR
from the list under apt search apertium
)
For other systems, please read Apertium documentation.
Bifixer is a tool that fixes bitexts and tags near-duplicates for removal. It's useful to fix errors in our training corpus.
git clone https://github.com/bitextor/bifixer.git
python3.7 -m pip install -r bifixer/requirements.txt