-
Notifications
You must be signed in to change notification settings - Fork 2
Evaluation
In order to evaluate all implemented models, an email speech dataset was created using some of the emails of my mentors. It consists of 100 emails, where the 90% was used as a train set and the rest 10% as a test set. The metric to be used is Word Accuracy.
Except for all the implemented tools in data/scripts
directory, an additional tool for recording all the dictations is implemented. Its scope is the fast recording of a speech dataset and its usage can be found here.
After applying general language model adaptation as described in Datasets and Adaptation, we have the following results:
Language model | Acoustic model | Accuracy |
---|---|---|
default | default | 69.07% |
specific | default | 67.14% |
merged | default | 79.79% |
merged | adapted (mllr) | 80.32% |
merged | adapted (map) | 76.98% |
merged | adapted (mllr + map) | 76.63% |
Various methods have been implemented in the clustering of the emails as described here. In order to select which of them to apply, we should first evaluate some of them. This analysis follows:
Cluster id | Cluster semantic | No. sentence | Accuracy |
---|---|---|---|
0 | salutations | 8 | 100.0% |
1 | other | 55 | 80.47% |
2 | closings | 1 | 100.0% |
- Elbow method: The definition of the maximum number of clusters to test affects the result. Above we present the figures of the sum of squared error using 5, 8 and 10 number of clusters. The selected number is the 'knee' of the figure.
8 max clusters | 10 max clusters | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
- tfidf: Low accuracy and no semantic information.
- spacy: Good word representation, minor issue the fact that sentence embeddings are the mean value of their word embeddings.
- word2vec: Low accuracy
- cbow:
Cluster id | Cluster semantic | No. sentence | Accuracy |
---|---|---|---|
0 | closings | 17 | 86.79% |
1 | other | 45 | 80.16% |
2 | other | 2 | 70.37% |
- skipgram:
Cluster id | Cluster semantic | No. sentence | Accuracy |
---|---|---|---|
0 | other | 50 | 80.04% |
1 | closings | 14 | 93.74% |
- doc2vec: Not ready yet.
Here, cosine similarity performs better that euclidean in all cases.
Based on the above analysis, we are going to use max silhouette in order to determine the number of clusters. Also, cosine similarity is the best option in cluster classification. Finally, spacy and skipgram word vectors have the best accuracy. Since skipgram vectors are trained in user's emails, we can use them when the email corpus is pretty large (over 100 emails).
After domain-specific language adaptation using the above techniques, the acoustic model should be adapted too. As we can see, mllr adaptation performs better, since our acoustic model is continuous and its map adaptation requires over 1-hour recordings.
- max silhouette with spacy embeddings:
Cluster id | No. sentence | Default | Map | Mllr |
---|---|---|---|---|
0 | 1 | 100.0% | 66.67% | 100% |
1 | 13 | 78.26% | 78.26% | 91.3% |
2 | 50 | 80.85% | 79.37% | 81.4% |
- max silhouette with skipgram trained embeddings:
Cluster id | No. sentence | Default | Map | Mllr |
---|---|---|---|---|
0 | 50 | 80.04% | 77.93% | 80.61% |
1 | 14 | 93.75% | 91.67% | 93.75% |
The adaptation works! The default acoustic and language model have 69.07% accuracy with 7 insertions, 47 deletions and 122 substitutions. Finally, using acoustic and domain-specific language model adaptation (max silhouette+spacy) the asr have the above accuracy with 4 insertions, 23 deletions and 76 substitutions.