Evaluation

Dataset Description

In order to evaluate all implemented models, an email speech dataset was created using some of the emails of my mentors. It consists of 100 emails, where the 90% was used as a train set and the rest 10% as a test set. The metric to be used is Word Accuracy.

Tools

Except for all the implemented tools in data/scripts directory, an additional tool for recording all the dictations is implemented. Its scope is the fast recording of a speech dataset and its usage can be found here.

Generic Language model adaptation

After applying general language model adaptation as described in Datasets and Adaptation, we have the following results:

Language model	Acoustic model	Accuracy
default	default	69.07%
specific	default	67.14%
merged	default	79.79%
merged	adapted (mllr)	80.32%
merged	adapted (map)	76.98%
merged	adapted (mllr + map)	76.63%

Domain-specific language model adaptation

Various methods have been implemented in the clustering of the emails as described here. In order to select which of them to apply, we should first evaluate some of them. This analysis follows:

Automatic selection of the number of clusters

Silhouette analysis

Cluster id	Cluster semantic	No. sentence	Accuracy
0	salutations	8	100.0%
1	other	55	80.47%
2	closings	1	100.0%

Elbow method: The definition of the maximum number of clusters to test affects the result. Above we present the figures of the sum of squared error using 5, 8 and 10 number of clusters. The selected number is the 'knee' of the figure.

8 max clusters

10 max clusters

Cluster id	Cluster semantic	No. sentence	Accuracy
0	salutations	23	71.88%
1	other	26	78.81%
2	closings	5	83.95%
3	closings	1	100.0%

Cluster id	Cluster semantic	No. sentence	Accuracy
0	closings	16	86.96%
1	salutation	10	88.24%
2	closings	1	100.0%
3	other	33	78.71%
4	salutations	4	60.0%

Representing sentences as vectors

tfidf: Low accuracy and no semantic information.
spacy: Good word representation, minor issue the fact that sentence embeddings are the mean value of their word embeddings.
word2vec: Low accuracy
cbow:

Cluster id	Cluster semantic	No. sentence	Accuracy
0	closings	17	86.79%
1	other	45	80.16%
2	other	2	70.37%

skipgram:

Cluster id	Cluster semantic	No. sentence	Accuracy
0	other	50	80.04%
1	closings	14	93.74%

doc2vec: Not ready yet.

Classifying a new vector in existing clusters

Here, cosine similarity performs better that euclidean in all cases.

Parameter Selection

Based on the above analysis, we are going to use max silhouette in order to determine the number of clusters. Also, cosine similarity is the best option in cluster classification. Finally, spacy and skipgram word vectors have the best accuracy. Since skipgram vectors are trained in user's emails, we can use them when the email corpus is pretty large (over 100 emails).

Acoustic model adaptation

After domain-specific language adaptation using the above techniques, the acoustic model should be adapted too. As we can see, mllr adaptation performs better, since our acoustic model is continuous and its map adaptation requires over 1-hour recordings.

max silhouette with spacy embeddings:

Cluster id	No. sentence	Default	Map	Mllr
0	1	100.0%	66.67%	100%
1	13	78.26%	78.26%	91.3%
2	50	80.85%	79.37%	81.4%

max silhouette with skipgram trained embeddings:

Cluster id	No. sentence	Default	Map	Mllr
0	50	80.04%	77.93%	80.61%
1	14	93.75%	91.67%	93.75%

Conclusion

The adaptation works! The default acoustic and language model have 69.07% accuracy with 7 insertions, 47 deletions and 122 substitutions. Finally, using acoustic and domain-specific language model adaptation (max silhouette+spacy) the asr have the above accuracy with 4 insertions, 23 deletions and 76 substitutions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation

Dataset Description

Tools

Generic Language model adaptation

Domain-specific language model adaptation

Automatic selection of the number of clusters

Representing sentences as vectors

Classifying a new vector in existing clusters

Parameter Selection

Acoustic model adaptation

Conclusion

Table of Contents

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally