This is a Case Study of training the acoustic model of Deep learning based Speech-to-Text/ASR engine for a Voice Bot that could take orders for Indian Food.
The Problem
The client approached Voicegain as they experienced very low accuracy of speech recognition for a specific telephony based voice bot for food ordering.
The voice bot had to recognize Indian food dishes with acceptable accuracy, so that the dialog could be conducted in a natural conversational manner rather than having to fallback to rigid call flows like e.g. enumerating through a list.
The spoken response would be provided by provided by speakers of South Asian Indian origin. This meant that in addition to having to recognize unique names, the accent would be a problem too.
The out-of-the box accuracy of Voicegain and other prominent ASR engines was considered too low. Our accuracy was particularly low because our training datasets did not have any examples of Indian Dish names spoken with heavy Indian accents.
With the use of Hints, the results improved significantly and we achieved an accuracy of over 30%. However, 30% was far from being good enough.
The Approach
Voicegain first collected relevant training data (audio and transcripts) and trained the acoustic model of our deep learning based ASR. We have had good success with it in the past, in particular with our latest DNN architecture, see e.g. post about recognition of UK postcodes.
We used a third party data generation service to initially collect over 11,000 samples of Indian Food utterances - 75 utterances per participant. The quality varied widely, but that is good because we think it reflected well the quality of the audio that would be encountered in a real application. Later we collected additional 4600 samples.
We trained two models:
- A "balanced" model - where the Food Dish training data was combine with our complete training set to train the model.
- A "focused" model - there the Food Dish data was combined with just a small subset of our other training data set.
We also first trained on the 10k set, collected the benchmark results, and then trained on the additional 5k data.
We randomly selected 12 sets of 75 utterances (total 894 after some bad recordings were removed) for a benchmark set and used the remaining 10k+ for training. We plan to share a link to the test data set here in a few days.
The Results - A 75% improvement in accuracy!
We compared our accuracy against Google and Amazon AWS both before and after training and the results are presented in a chart below. The accuracy presented here is the accuracy of recognizing the whole dish name correctly. If one word of several in a dish name was mis-recognized, then it was counted as a failure to recognize the dish name. We applied the same methodology if one extra word was recognized, except for additional words that can easily be ignored, e.g., "a", "the", etc. We also allowed for reasonable variances in spelling that would not introduce ambiguity, e.g. "biryani" was considered a match to "biriyani".
Note that the tests on Voicegain recognizer were ran with various audio encodings:
- PCMU 8kHz - is a telephony quality audio
- L16 16kHz - is closer to the audio quality you would expect from most webrtc applications and delivers better accuracy
Also, the AWS test was done in offline mode (which generally delivers better accuracy), while Google and Voicegain tests were done in streaming (real-time) mode.
We did a similar set of tests with the use of hints (we did not include AWS because our test script did not support AWS hints at that time).
This shows that huge benefits can be achieved by targeted model training for speech recognition. For this domain, that was new to our model, we increased accuracy by over 75% (10.18% to 86.24%) as result of training.
As you can see, after training we exceeded the Speech-to-Text accuracy of Google by over 45% (86.24% vs 40.38%) if no hints were used. With the use of hints we were better than Google STT by about 36% (87.58% vs 61.30%).
We examined cases where mistakes were still made and they fell into 3 broad categories:
- Recordings missing an end part of the last word. That is because the stop record button was pressed while the last word was still being spoken. The recorded part of the last word is generally recognized ok, e.g., instead of "curry" we recognize "cu". (We plan to manually review the benchmarks set and modify the expected values according to what is being said and then recompute the accuracy numbers.)
- Really bad quality recordings - where the volume of the audio is barely over the background noise level. In this case we usually missed some words or parts of words. This also explains why the hints do not give more improvement - there are no sufficient quality partial hypotheses that the hints could boost.
- Loud background speech noise. In this case we usually recognized additional words beyond what was expected.
The first type of problems we think can be overcome by training on additional data and that is what we are planning to do, hoping to eventually get accuracy close to 85% (for L16 16kHz audio). The second type could be potentially resolved by post-processing in the application logic if we return the dB values of the recognized words.
Interested?
If your speech application also suffers from low accuracy and using hints or text-based language models is not working well enough, then acoustic model training could be the answer. Send us an email at info@voicegain.ai and we could discuss doing a project to show how Voicegain trained model can achieve best accuracy on your domain.