This article outlines how the modern Voicegain deep-learning based Speech-to-Text/ASR can be a simple and affordable alternative for businesses that are looking for a quick and easy replacement to their on-premise Nuance Recognizer. Nuance has announced that its going to end support for Nuance Recognizer, its grammar-based ASR which uses the MRCP protocol, sometime in 2026 or 2027. So organizations that have a Speech-enabled IVR as their front door to the contact center need to start planning now.
With the rise of Generative AI and highly accurate low latency speech-to-text models, the front door of the call center is poised for major transformation. The infamous and highly frustrating IVR phone menu will be replaced by Conversational AI Voicebots; but this will likely happen over the next 3-5 years. As enterprises start to plan their migration journey from these tree-based IVRs to an Agentic AI future, they would like to do this on their timelines. In other words, they do not want to be forced to do this under the pressure of a deadline because of EOL of their vendor.
In addition, the migration path proposed by Nuance is a multi-tenant cloud offering. While a cloud based ASR/Speech-to-Text engine is likely to make sense for most businesses, there are companies in regulated sectors that are prevented from sending their sensitive audio data to a multi-tenant cloud offering.
In addition to the EOL announcement by Nuance for their on-premise ASR, a major IVR platform vendor like Genesys has also announced that its premise-based offerings - Genesys Engage and Genesys Connect - will also approach EOL at the same time as the Nuance ASR.
So businesses that want a modern Gen AI powered Voice Assistant but want to keep the IVR on-premise in their datacenter or behind their firewall in a VPC will need to start planning very quickly what their strategy is going to be.
At Voicegain, we allow enterprises that are in this situation and want to remain on-premise or in their VPC with a modern Voicebot platform. This Voicebot platform runs on modern Kubernetes clusters and leverages the latest NVIDIA GPUs.
Rewriting the IVR Application logic to migrate from a tree-based IVR menu to a conversational Voice Assistant is a journey. It would require investments and allocation of resources. Hence a good first step is to simply replace the underlying Nuance ASR (and possibly the IVR platform too). This will guarantee that a company can migrate to a modern Gen-AI Voice Assistant on its timelines.
Voicegain offers a modern highly accurate deep-learning-based Speech-to-text engine trained on hundreds of thousands of hours of telephone conversations. It is integrated into our native modern telephony stack. It can also talk over the MRCP protocol with VoiceXML based IVR platforms and it supports the traditional Speech grammars (SRGS, JJSGF). Voicegain also supports a range of built-in grammars (like Zipcode, Dates etc).
As a result, it is a simple "drop-in" replacement to the Nuance Recognizer. There is no need to rewrite the current IVR application. Instead of pointing to the IP address of the Nuance Server, the VoiceXML platform just needs to be reconfigured to point to the IP address of the Voicegain ASR server. This should take no more than a couple of minutes.
In addition to the Voicegain ASR/STT engine, we also offer a Telephony Bot API. This is a callback style API that includes our native IVR platform and ASR/STT engine can be used to build Gen AI powered Voicebots. It integrates with leading LLMs - both cloud and open-source premise based - to drive a natural language conversation with the callers.
If you would like to discuss your IVR migration journey, please email us at . At Voicegain, we have decades of experience in designing, building and launching conversational IVRs and Voice Assistants.
Here is also a link to more information. Please feel free to schedule a call directly with one of our Co-founders.
[UPDATE - October 31st, 2021: Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced.]
That is the question that we are frequently asked by our potential customers. Often we answer "that depends" and we get a feeling that the other side thinks "must be really bad if they do not give a straight answer". However, "that depends" is really the right answer. Accuracy of automated speech recognition (ASR) depends on the audio in many ways and the effect is not small. Basically, accuracy can be all over the place depending on factors like:
Because the accuracy or Word Error Rate questions are somewhat meaningless without specifying the type of speech audio, it is important to do testing when choosing a speech recognizer. As a test set, one would choose a set of audio files, that accurately represent the spectrum of the speech that will be encountered by the recognizer in the expected use cases. For each speech audio file from the set one would obtain a gold/reference transcript that is 100% accurate. After that, things can be automated -- transcribe each file on the recognizers being evaluated, compute WER against the reference for each of the generated transcripts, and collate the results. The combined results will present a clear picture of how the recognizers perform on the specific speech audio that we care about. If you are going to repeat this process often, e.g., to evaluate new candidates on the recognizer marker, it is good to standardize the test set, basically creating a repeatable benchmark that can be referenced in the future.
The benchmark results that we are presenting here are somewhat different than the use-case driven tests or benchmarks. Because we are building a general recognizer for an unspecified use case, we intentionally decided to use a very broad set of audio files. Rather than collecting the test files ourselves, we decided to use the data set described in "Which Automatic Transcription Service is the Most Accurate? — 2018" from September 2018 by Jason Kincaid. The article presents a comparison of Speech Recognizers from various companies using a set of 48 YouTube videos (taking 5 minutes of audio from each of the videos). By the time we decided to do a retest of Jason's benchmark, 4 videos were no longer accessible, so our benchmark presented here uses data from only 44 videos.
We compared the results presented by Jason to the results from the big 3 - Google, Amazon, and Microsoft - recognizers as of June 2020. Of course, we also included our Voicegain recognizer, because we wanted to see how we stacked against those. All the tested recognizers use Deep Neural Networks. The Voicegain speech recognizer ran on the Google Cloud Platform using Nvidia T4 GPUs. All recognizers were run with default settings and no hints nor user language models were used.
It is important to mention that none of the benchmark files are included in the training set that Voicegain uses. Neither is other audio from the speakers from the benchmark files, nor the same content but spoken by other speakers.
Again, the best recognizer is not the right question, because it all depends on your actual speech audio it is used on. But the key results from testing on the 44 files are as follows:
Here are our thoughts and some details:
We welcome anyone to test our platform and see how it performs on speech audio types that matter for your use cases.
We have Open Sourced the key component of our benchmark suite, the transcribe_compare python utility. It is available here: under MIT license.
It is useful for automatic benchmarking but it can also output data to an html file which can be viewed in a web browser. We use it often this way to do a manual review of the transcription errors or differences in errors between two recognizers or recognizer versions.
If you are building an app that requires transcription, sign up today for a developer account and get $50 in free credits (~5000 minutes of platform use). You can check out our accuracy add test our APIs. Instructions to sign up for a developer account are provided here.
3. If you want to make Voicegain your own AI Transcription Assistant, click here. You can take Voicegain to meetings, webinars, talks, lectures and more.
We are still in the middle of extensive data collection effort and the training is not over yet. We are seeing continuing improvement in our recognizer, with the new improved versions of the acoustic model deployed to production about twice a month. We will report updated benchmark results on our blog in a few months.
We have another blog post planned that is going to quantify the benefit one can expect from using additional user data to train the acoustic model used in the recognizer. We have selected a large data set with a very specific English accent that currently has higher WER. We will report on the impact on WER of training on such a data set. We will quantify the improvement based on the size of the data set and the duration of training.
Voicegain provides easy to use tools that allow users to build their own custom acoustic models. This upcoming post will provide a clear insight as to what improvements to expect and how much data is needed to make a difference in reducing WER.
If you have any questions regarding this article or our platform and recognizer you can contact us at
The video below shows an example of Voicegain Live Transcribe used to provide transcription for an event streamed over video.
Here are some details about this particular setup:
Current speech-to-text enterprise market can be divided into 3 distinct groups of players. Note, that we are focusing here on speech-to-text platforms rather than complete end-user products (so we do not include consumer products like Dragon NaturallySpeaking, etc.)
We consider ourselves as as one of the new players as we started working on our own DNN-based speech-to-text engine at the end of 2016. However, we have been working with old style ASRs since 2006 and as a result we knew very well limitations of those. That was what motivated us to develop ASRs of our own.
We are also very familiar with employing ASRs in real-world large volume applications so we know which features the users of ASRs want - be it developers who build the applications, or IT personnel that has to host and maintain them.
All of this guided us in decisions we made when developing our speech-to-text platform.
Below we list what we think are 4 key differentiators of our speech-to-text platform compared to competition. Note that the competitive field is pretty broad, and we consider a particular feature a differentiator if it is not a common feature in the market.
By, Edge Deployment we mean a deployment on customer premises (datacenter) or on VPC. Moreover, the deployment is fully orchestrated and managed from the Cloud (for more information see our blog post about Benefits of Edge Deployment). The aspect of orchestration and built-in management makes it essentially different from the old ASRs which were also deployed on-prem and required Support Contracts do deploy them successfully and to maintain them over time.
We think that Edge Deployment is critical for a speech-to-text platform which is to replace many of the old ASRs in their applications.
Over the years when working with ASRs we noticed that there were cases where the ASR would show consistently higher error rates. Usually, this was related to IVR calls coming from customers in regions of the country with distinct accents.
In some of our use cases so far, ability to customize models has allowed us to reduce WER very significantly (e.g. from 8% WER to 3%).
We are currently working on a rigorous experiment where we are customizing our model to support Irish English. We plan to report in detail on the results in April.
Voicegain speech-to-text platform was developed specifically with IVR use cases in mind. Currently the platform supports the following 3 IVR uses cases, and we are working on adding conversational NLU later this year.
a) ASR with support for legacy IVR Standards
In order to make our speech-to-text engine an attractive solution for replacement of old ASRs, we implemented it to support legacy standards like MRCP and GRXML. That support is not a mere add-on, simply tagging a Web API on the back of an MRCP server, but is more integral - our core speech-to-text engine directly interprets a superset of MCRP protocol commands.
We also support GRXML and JSGF grammars - via MRCP, in IVR callbacks, and over Web API.
When used with grammars, big advantage of Voicegain recognizer is that at the core it is a large vocabulary recognizer. Grammars are used to do constrain the recognized utterances to facilitate semantic mapping, but the recognizer can also recognize Out-of-Grammar utterances, which opens new possibilities for IVR tuning.
b) Web-hook IVR Support (without VXML)
Flow-based IVR systems have traditionally been built using two approaches - (i) either having the dialog interactions interpreted on a VXML platform (VXML browser), or (ii) using webhooks invoking application logic running on standard web back-end platforms (examples of the latter are offerings of e.g. Twilio, Plivo, or Tropo).
Our platform supports webhook style IVRs. Incoming calls can be interfaced via standard telephony SIP/RTP, and the IVR dialog can be directed from any platform that implements web-hooks (e.g. Node.js, Django)
c) Enabling IVRs that use chatbot back-end
Many companies have invested significant effort into building their text based chatbots rather than using products like Google Dialogflow. What Voicegain platform provides is an easy way to deploy the existing chatbot logic on a telephony speech channel. This takes advantage of our platform's webhook-ivr IVR support and can feed real-time text (including multiple alternatives) to a chatbot platform. We also provide audio output either via TTS or prerecorded clips.
Because IVR has always been our focus, we built our Acoustic Models to support low latency real-time speech-to-text (both continuous large vocabulary and with context-free grammars). We also focused on convenient ways to stream audio into our speech-to-text platform, and to consume the generated transcript.
One of our products is Live Transcribe which allows for real-time transcription (with just few seconds delay) which is then broadcast over websockets and can be consumed on provided web clients. This opens possibility to do live speaker transcription with uses cases that may include conferences, lectures, etc. making these events easier to participate by hearing impaired audience members.
In this post we show in three steps what is needed to run your first transcription using Voicegain API.
We assume that you already signed up for Voicegain account and logged into the portal.
Main reason to create new Context is to establish new authentication realm. Access to each Context can be separately controlled, so it is easy to disable access to certain Context without affecting other Contexts.
Contexts are also used for specifying default ASR settings.
You can create a new Context from the Context Dash
Voicegain APIs use JWT (JSON Web Tokens) to identify and authenticate the account making the request. In order to make API requests you need to generate a JWT which can easily be done from the portal.
Below is the complete input and output from curl command that submits a Web API request to Voicegain Synchronous Speech-to-Text API
In this case, the audio to be transcribed was retrieved from a URL. Audio can alternatively also be submitted in-line (within request).
Note that synchronous transcription has audio length limit of 60 seconds. Longer audio requires use of asynchronous transcription API.
For asynchronous transcription requests it is possible to stream the audio, e.g. via websocket. You can see some of Voicegain API documentation at:
There is no denying that services available in the Cloud have significant benefits and is hence a popular choice. That is why Voicegain Speech-to-Text Platform is available both in the Cloud and at the Edge. The key benefits of accessing Voicegain as a Cloud services are:
Before we discuss the benefits of Edge Deployment let's define what we mean by it.
Edge Computing for Speech-to-Text services has many advantages:
You may ask - what about the benefits of the Cloud, mentioned upfront? Do I get some of these with the Edge Deployment?
The answer is (qualified) "yes", and specifically:
Countryside Bible Church has been using VoiceGain platform for real-time transcription since September 2018 (when our platform was still in alpha).
In August 2018 one of our employees was approached by staff at CBC with a question about a software that would allow a deaf person to follow sermons live via transcription. One of the members at CBC is both hearing and vision impaired and cannot easily follow sign language; however, she can read large font on a computer screen from close by.
In August, Voicegain just started alpha tests of the platform, so his response was that indeed he knew such software and it was Voicegain. At that time, our testing was focusing on IVR use cases, so we still needed a few weeks to polish the transcription APIs and develop a web app that could consume the transcript stream (via websocket) and present it as scrolling text in a browser.
To improve recognition, we used about 200 hours of previously transcribed sermons from CBC to adapt our Acoustic DNN Model. Additionally, we created a specific CBC Language Model, by adding a corpus of text from several Bible translation, various transcribed sermons, list of CBC staff names, etc.
As far as the input audio is concerned, initially, we were streaming audio using a standard RTP protocol from ffmpeg tool. We had some issues with a reliability of raw RTP, so later we switched to a custom Java client that sends the audio using a proprietary protocol. The client runs as a daemon on a small Raspberry Pi device.
CBC audio-visual team has been running real-time transcription using our platform since September 2018, pretty much ever Sunday. You can see an example of the transcription in action in the video below
Current plans for the transcription service is to integrate it into CBC website and to make it available together with streamed video. This will allow hearing impaired to follow the services at home via streaming. For now, the transcription text will be presented as an embedded web page element under the embedded video.
Because the streamed video is more than 30 seconds delayed w.r.t. the real-time, we will be feeding the audio simultaneously to two ASR engines, one optimized for real-time response, and one optimized for accuracy. This is easy, because Voicegain Web API provides methods that allow for attaching two ASR sessions to a single audio stream. Each session, can in turn feed its own websocket stream. By accessing the appropriate websocket stream, web UI can display either the real-time of delayed transcript.
Because of their Terms of Use, we cannot provide direct results for any of the major ASR engines, but you can download the audio linked below, as well as the corresponding exact Transcripts and run comparison tests on a recognizer of your choice. Note that Voicegain ASR does ignore most of duplicated words that are in audio, that is why the transcript does have those duplicates removed.
The audio is Copyright of Countryside Bible Church and transcripts are Copyright of Voicegain.
1. God's Plan for Human History (Part 2)
Tom Pennington | Daniel 2 | 2018-11-04 PM
55 minutes 13 seconds, 7475 words
Audio Transcript VoiceGain Output
Accuracy: 1.08% character error rate
Note: Voicegain output is formatted to match Transcript. Normally it also includes timing information. This specific output was obtained on 4/30/19 from real-time recognizer which has slightly lower accuracy compared to off-line recognizer.
Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Interested in customizing the ASR or deploying Voicegain on your infrastructure?