-
Notifications
You must be signed in to change notification settings - Fork 1
Datasets and Adaptation
The existing general acoustic and language model doesn't perform well and training another general Greek model is difficult, since very small Greek speech datasets are available. So, the whole project's success is based on the personalization of the dictation system. The acoustic model will be adapted in user's dictations and the language model will be enhanced by taking advantage of the send emails of the user. In order to verify that adaptation increases the accuracy of an asr system, tests were done in available datasets.
All datasets have been uploaded in Dropbox. Each one follows the above structure, based on Sphinx requirements.:
- train: Contains the ids, the recordings and the corresponding transcriptions of the train set (usually 70% of the dataset).
- test: Contains the ids, the recordings and the corresponding transcriptions of the test set (usually 30% of the dataset).
- hypothesis: Contains the hypothesis for the test set of each model.
-
language-models: Contains all the language models that created based on the train set.
- specific: Developed using only the transcriptions of the dataset.
- merged: Developed using both the transcriptions of the dataset and the default language model.
- adaptation: Contains all the files used for the acoustic model adaptation (both mllr and map method).
The adaptation of the default language model for domain-specific datasets follows the corresponding Sphinx tutorial and uses the SRILM toolkit.
For every dataset, we should create a train-text.txt
file that contains only the Greek alphabetic characters of the transcriptions. So, we remove punctuation, non-alphabetic and non-Greek words. This procedure is described extensively in Email Fetching page. Then, a domain-specific language model is created using the following command:
ngram-count -kndiscount -interpolate -text train-text.txt -lm specific.lm
Although the domain-specific language model is adapted to our dataset, it will not perform well since it is based on a relatively small amount of words. So, there are a lot of out of the vocabulary words. In order to resolve this, we merge the domain-specific language model with the default one, as follows:
ngram -lm el-gr.dic -mix-lm specific.lm -lambda 0.5 -write-lm merged.lm
where lambda is the weight of each model.
The adaptation of the default acoustic model for domain-specific datasets follows again the corresponding Sphinx tutorial. Useful tools were developed in order to prepare speech dataset for adaptation.
Usage:
$ python converter.py -h
usage: converter.py [-h] --input INPUT [--output OUTPUT]
Tool for converting sound files in Sphinx format (mono wav files with 16kHz
sample rate)
optional arguments:
-h, --help show this help message and exit
required arguments:
--input INPUT Input directory
optional arguments:
--output OUTPUT Output direcory (default: Input directory)
When adapting an acoustic model, some words from the transcriptions may not be included in the default phonetic dictionary. As a result, if we ignore them, adaptation will be poor and these words will be unknown for the system.
A tool was developed that first searches in the transcriptions for words that are out of the dictionary. Then, Phonetisaurus is used in order to generate phonemes for missing words and, finally, the pair (word, phoneme)
is added in the default dictionary. In fact, AltFstAligner (an alternative of Phonetisaurus) was used for the training of the model, because it requires much lower memory.
Usage:
$ python findOOD.py -h
usage: findOOD.py [-h] --dict DICT --input INPUT --output OUTPUT
Tool that finds out of dictionary words from a given transcription
optional arguments:
-h, --help show this help message and exit
required arguments:
--dict DICT Path of dictionary
--input INPUT Path of input transcription (should be in Sphinx format)
--output OUTPUT File to write the missing words
$ python addOOD.py -h
usage: addOOD.py [-h] --model MODEL --input INPUT --dict DICT
Tool that generates phonemes for out of dictionary words and adds them in the
dictionary
optional arguments:
-h, --help show this help message and exit
required arguments:
--model MODEL Phonetisaurus model
--input INPUT Path of missing words file.
--dict DICT Path of the dictionary
The trained model on the default dictionary can be found here. An example of the usage of the scripts follow:
If we have the above transcription file:
καλησπέρα με λένε γιώργο μπαλαμώτη (test)
The word μπαλαμώτη
is not included in the default dictionary, since it is a username. Let's generate the phoneme of this out of the dictionary word:
$ python findOOD.py --dict ../../cmusphinx-el-gr-5.2/el-gr.dic --input transcription --output missing --print True
Searching for transcription: (test)
μπαλαμώτη
$ python addOOD.py --model ../../cmusphinx-el-gr-5.2/phonetisaurus/el-gr.o8.fst --input missing --dict ../../cmusphinx-el-gr-5.2/el-gr.dic
Generating phonemes...
Copy generated phonemes to given dictionary...
OK
$ tail -n 1 ../../cmusphinx-el-gr-5.2/el-gr.dic
μπαλαμώτη b a0 l a0 m o1 t i0
After decoding sound files to text using pocketsphinx_batch
tool from pocketsphinx, we evaluate a model using word_align.pl
script that compares the transcriptions of the test file with the hypothesis that the model gave, as follows:
word_align.pl test.transcription test.hyp
Note: Two methods of acoustic model adaptation were tested. Map adaptation updates each parameter in the model, while mllr adaptation creates a generic transform of the parameters.
Multiple Greek speakers of the Department of Journalism tell the news lasting 1 hour (medium size, different speakers).
Link: https://www.dropbox.com/sh/a8dkcgchb3cxgnc/AAA-7uxX8embvJWPOW-yQFTGa?dl=0
Language model | Acoustic model | Accuracy |
---|---|---|
default | default | 53.28% |
specific | default | 53.92% |
merged | default | 66.03% |
merged | adapted (mllr) | 67.91% |
merged | adapted (map) | 50.03% |
Greek woman speaker reads a fairytale lasting 4 hours (large size, one speaker).
Link: https://www.dropbox.com/sh/87e87d78ykw96zi/AABoh1oHDjJrhv4BoNiEPs8qa?dl=0
Language model | Acoustic model | Accuracy |
---|---|---|
default | default | 59.55% |
specific | default | 51.99% |
merged | default | 65.04% |
merged | adapted (mllr) | 66.53% |
merged | adapted (map) | 71.68% |
Recordings of Greek people asking questions about the weather, nearest hospitals, and pharmacies. It was created for the purposes of this diploma thesis (medium size, different speakers, very specific domain).
Link: https://www.dropbox.com/sh/t7uwom0hxp7cehb/AAC5EEB18DSm8qGLXFfobquWa?dl=0
Language model | Acoustic model | Accuracy |
---|---|---|
default | default | 73.11% |
specific | default | 80.06% |
merged | default | 83.08% |
merged | adapted (mllr) | 84.59% |
merged | adapted (map) | 90.63% |
Recordings of my voice dictating 15 emails (small size). This dataset is representative of the data that our system will have to adapt, but it should be extended because the test set contains only 4 sentences.
Link: https://www.dropbox.com/sh/oguos83j7938q39/AABEd0I9CkXKfV91NsxZuSTZa?dl=0
Language model | Acoustic model | Accuracy |
---|---|---|
default | default | 75.71% |
specific | default | 25.71% |
merged | default | 77.14% |
merged | adapted (mllr) | 77.14% |
merged | adapted (map) | 50.00% |
- Merged language model improves accuracy for all types of datasets, since it adapts to the domain-specific dataset and, at the same time, it contains a large number of words.
- Mllr adaptation performs better when limited data are available or when speakers are different (personal emails and radio). On the other hand, map adaptation can increase accuracy a lot (pda, paramythi_horis_onoma) in case we have more dictations.