This article outlines how the modern Voicegain deep-learning based Speech-to-Text/ASR can be a simple and affordable alternative for businesses that are looking for a quick and easy replacement to their on-premise Nuance Recognizer. Nuance has announced that its going to end support for Nuance Recognizer, its grammar-based ASR which uses the MRCP protocol, sometime in 2026 or 2027. So organizations that have a Speech-enabled IVR as their front door to the contact center need to start planning now.
With the rise of Generative AI and highly accurate low latency speech-to-text models, the front door of the call center is poised for major transformation. The infamous and highly frustrating IVR phone menu will be replaced by Conversational AI Voicebots; but this will likely happen over the next 3-5 years. As enterprises start to plan their migration journey from these tree-based IVRs to an Agentic AI future, they would like to do this on their timelines. In other words, they do not want to be forced to do this under the pressure of a deadline because of EOL of their vendor.
In addition, the migration path proposed by Nuance is a multi-tenant cloud offering. While a cloud based ASR/Speech-to-Text engine is likely to make sense for most businesses, there are companies in regulated sectors that are prevented from sending their sensitive audio data to a multi-tenant cloud offering.
In addition to the EOL announcement by Nuance for their on-premise ASR, a major IVR platform vendor like Genesys has also announced that its premise-based offerings - Genesys Engage and Genesys Connect - will also approach EOL at the same time as the Nuance ASR.
So businesses that want a modern Gen AI powered Voice Assistant but want to keep the IVR on-premise in their datacenter or behind their firewall in a VPC will need to start planning very quickly what their strategy is going to be.
At Voicegain, we allow enterprises that are in this situation and want to remain on-premise or in their VPC with a modern Voicebot platform. This Voicebot platform runs on modern Kubernetes clusters and leverages the latest NVIDIA GPUs.
Rewriting the IVR Application logic to migrate from a tree-based IVR menu to a conversational Voice Assistant is a journey. It would require investments and allocation of resources. Hence a good first step is to simply replace the underlying Nuance ASR (and possibly the IVR platform too). This will guarantee that a company can migrate to a modern Gen-AI Voice Assistant on its timelines.
Voicegain offers a modern highly accurate deep-learning-based Speech-to-text engine trained on hundreds of thousands of hours of telephone conversations. It is integrated into our native modern telephony stack. It can also talk over the MRCP protocol with VoiceXML based IVR platforms and it supports the traditional Speech grammars (SRGS, JJSGF). Voicegain also supports a range of built-in grammars (like Zipcode, Dates etc).
As a result, it is a simple "drop-in" replacement to the Nuance Recognizer. There is no need to rewrite the current IVR application. Instead of pointing to the IP address of the Nuance Server, the VoiceXML platform just needs to be reconfigured to point to the IP address of the Voicegain ASR server. This should take no more than a couple of minutes.
In addition to the Voicegain ASR/STT engine, we also offer a Telephony Bot API. This is a callback style API that includes our native IVR platform and ASR/STT engine can be used to build Gen AI powered Voicebots. It integrates with leading LLMs - both cloud and open-source premise based - to drive a natural language conversation with the callers.
If you would like to discuss your IVR migration journey, please email us at sales@voicegain.ai . At Voicegain, we have decades of experience in designing, building and launching conversational IVRs and Voice Assistants.
Here is also a link to more information. Please feel free to schedule a call directly with one of our Co-founders.
We want to share a short video showing live transcription in action at CBC. This one is using our baseline Acoustic Model. No customizations were made, no hints used. This video gives an idea of what latency is achievable with real-time transcription.
Automated real-time transcription is a great solution for accommodating hearing impaired if no sign-language interpreter is available. I can be used, e.g., at churches to transcribe sermons, at conventions and meetings to transcribe talks, at educational institutions (schools, universities) to live transcribe lessons and lectures, etc.
Voicegain Platform provides a complete stack to support live transcription:
Very high accuracy - above that provided by Google, Amazon, and Microsoft Cloud speech-to-text - can be achieved through Acoustic Model customization.
Voicegain adds grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.
The difference between Voicegain speech recognition and Twilio TwiML <Gather> is:
When using Voicegain with Twilio, your application logic will need to handle callback requests from both Twilio and Voicegain.
Each recognition will involve two main steps described below:
This is done by invoking Voicegain async recognition API: /asr/recognize/async
Below is an example of the payload needed to start a new recognition session:
Some notes about the content of the request:
This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.
Note, if the grammar is specified to recognize DTMF, the Voicegain recognizer will recognize DTMF signals included in the audio sent from Twilio Platform.
After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:
Some notes about the content of the TwiML request:
Below is an example response from the recognition. This response is from built-in phone grammar.
Some of the feedback that we received regarding the previously published benchmark data, see here and here, was concerning the fact that the Jason Kincaid data set contained some audio that produced terrible WER across all recognizers and in practice no one would user automated speech recognition on such files. That is true. In our opinion, there are very few use cases where WER worse than 20%, i.e. where on average 1 in every 5 words is recognized incorrectly, is acceptable.
What we have done for this blog post is we have removed from the reported set those benchmark files for which none on the recognizers tested could deliver WER 20% or less. This criterion resulted in removal of 10 files - 9 from the Jason Kincaid set of 44 and 1 file from the rev.ai set of 20. The files removed fall into 3 categories:
As you can see, Voicegain and Amazon recognizers are very evenly matched with average WER differing only by 0.02%, the same holds for Google Enhanced and Microsoft recognizer with the WER difference being only 0.04%. The WER of Google Standard is about twice of the other recognizers.
[UPDATE - October 31st, 2021: Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced. Our pricing is now 0.95 cents/minute]
[UPDATE: For results reported using slightly different methodology see our new blog post.]
This is a continuation of the blog post from June where we reported the previous speech-to-text accuracy results. We encourage you to read it first, as it sets up a context to better understand the significance of benchmarking for speech-to-text.
Apart for that background intro, the key differences from the previous post are:
Here are the results.
Less than 3 months have passed from the previous test, so it is not surprising to see no improvement on Google and Amazon recognizers.
Voicegain recognizer has how overtaken Amazon by a hair breadth in average accuracy, although Amazon median accuracy on this data set is slightly above Voicegain.
Microsrosoft recognizer has improved during this time period - on the 44 benchmark files it is now on average better than Google Enhanced (in the chart we retained ordering from the June test). The single bad outlier in Google Enhanced results does alone not account for the better average WER on the Microsoft on this data set.
Google Standard is still very bad and we will likely stop reporting on it in detail in our future comparisons.
The audio from the 20-file rev.ai test is not as challenging as some of the files in the 44-file benchmark set. Consequently the results are on average better but the ranking of the recognizers does not change.
As you can see in this chart, on this data set the Voicegain recognizer is marginally better than Amazon in. It has lower WER on 13 out of 20 test files and it beats Amazon in the mean and median values. On this data set Google Enhanced beats Microsoft.
Finally, here are the combined results for all the 64 benchmark files we tested.
On the combined benchmark Voicegain beats Amazon both in average and median WER, although the median advantage is not as big as on the 20 file rev.ai set. [Note that as of 2/10/21 Voicegain WER is now 16.46|14.26]
What we would like to point out is that when comparing Google Enhanced to Microsoft, one wins if we compare the average WER while the other has a better median WER value. This highlights that the results vary a lot depending on what specific audio file is being compared.
These results show that choosing the best recognizer for a given application should be done only after thorough testing. Performance of the recognizers varies a lot depending on the audio data and acoustic environment. Moreover, the prices vary significantly. We encourage you to try the Voicegain Speech-to-Text engine for your application. It might be a better fit for your application. Even if the accuracy is a couple of points behind the two top players, you might still want to consider Voicegain because:
Voicegain launched an extension to Voicegain /asr/recognize API that supports Twilio Media Streams via TwiML <Connect><Stream>. With this launch, developers using Twilio's Programmable Voice get an accurate, affordable, and easy to use ASR to build Voice Bots /Speech-IVRs.
Update: Voicegain also announced that its large vocabulary transcription (/asr/transcribe API) integrates with Twilio Media Streams. Developers may use this to voice enable a chat bot developed on any bot platform or develop a real-time agent assist application.
Voicegain Twilio Media Streams support gives developers the following features:
TwiML <Stream> requires a websocket url. This url can be obtained by invoking Voicegain /asr/recognize/async API. When invoking this API the grammar to be used in the recognition has to be provided. The websocket URL will be returned in the response.
In addition to the wss url, Custom Parameters within <Connect><Stream> command are used to pass information about the question prompt to be played to the caller by Voicegain. This can be a text or a url to a service that will provide the audio.
Once <Connect><Stream> has been invoked, Voicegain platform takes over- it:
BTW, we also support DTMF input as an alternative to speech input.
[UPDATE: you can see more details of how to use Voicegain with Twilio Media Streams in this new Blog post.]
1. On Premise Edge Support: While Voicegain APIs are available as a cloud PaaS service, Voicegain also supports OnPrem/Edge deployment. Voicegain can be deployed as a containerized service on a single node Kubernetes cluster, or onto multi-node high-availability Kubernetes cluster (on your GPU hardware or your VPC).
2. Acoustic model customization: This allows to achieve very high accuracy beyond what is possible with out of the box recognizers. The grammar tuning and regression tool mentioned earlier, can be used to collect training data for acoustic model customization.
On our near-term roadmap for Twilio users we have several more features:
You can sign up to try our platform. We are offering 600 minutes of free monthly use of the platform. If you have questions about integration with Twilio, send us a note at support@voicegain.ai.
Twilio, TwiML and Twilio Programmable Voice are registered trademarks of Twilio, Inc
Businesses of all sizes are looking to develop Voicebots to automate customer service calls or voice based sales interactions. These bots may be voice versions of existing Chatbots, or exclusively voice based bots. While Chatbots automate routine transactions over the web, many users like the ability to use voice (app or phone) when it is convenient.
A voice bot dialog consists of multiple interactions where a single interaction typically involves 3 steps:
For the first step, developers use a Speech-to-Text platform to transcribe the spoken utterance into text. ASR or Automatic Speech Recognition is another term that is used to describe the same type of software.
When it comes to extracting intent from the customer utterance, they typically use an NLU engine. This is understandable because developers would like to re-use the dialog flow or conversation turns programmed in their Chatbot App for their Voicebot.
A second option is to use Speech Grammars which match the spoken utterance and assign meaning (intent) to it. This option is not in vogue these days but Speech Grammars have been successfully used in telephony IVR systems that supported speech interaction using ASR.
This article explores both approaches to building Voicebots.
Most developers today use the NLU approach as a default option for Steps 2 and 3. Popular NLU engines include Google Dialog Flow, Microsoft LUIS, Amazon Lex and also increasingly an open source framework like RASA.
An NLU Engine helps developers configure different intents that match training phrases, specify input and output contexts that are associated with these intents, and define actions that drive the conversation turns. This method of development is very powerful and expressive. It allows you to build bots that are truly conversational. If you use NLU to build a Chatbot you can generally reuse its application logic for a Voicebot.
But it has a significant drawback. You need to hire highly skilled natural language developers. Designing new intents, handling input and output contexts, entities etc is not easy. Since you require skilled developers, the development of bots using NLU is expensive. It is not just expensive to build but it is costly to maintain too. For example, if you want to add new skills to the bot that are beyond its initial set of capabilities, modifying the contexts is not an easy process.
Net-net the NLU approach is a really good fit if (a) you want to develop a sophisticated bot that can support a truly conversational experience (b) you are able to hire and engage skilled NLP developers and (c) you have adequate budgets to develop such bots.
One approach that was used in the past and seems to have been forgotten these days is the use of Speech Grammars. Grammars were used extensively to build traditional telephony based speech IVRs for over 20 years now, but most NLP and web developers are not aware of them.
A Speech Grammar provides either a list of all utterances that can be recognized, or, more commonly, a set of rules that can generate the utterances that can be recognized. Such grammar combines two functions:
The second function is achieved by attaching tags to the rules in the grammars. Tag formats exist that support complex expressions to be evaluated for grammars that have many nested rules. These tags allow the developer to essentially code intent extraction right into the grammar.
Also Step 3 - which is the dialog/conversation flow management - can be implemented in any backend programming language - Java, Python or Node.js. Developers of voice bots that are on a budget and are looking to building a simple bot with just a few intents should strongly consider grammars as an alternative approach to NLU.
Voicegain is one of the few Speech-to-Text or ASR engines that supports both approaches.
Developers can easily integrate Voicegain's large vocabulary speech-to-text (Transcribe API) with any popular NLU engine. One advantage that we have here is the ability to output multiple hypotheses - when using the word-tree output mode. This allows multiple NLU intent matches to be done of the different speech hypotheses with the goal of determining if the there is an NLU consensus in spite of differing speech-to-text output. This approach can deliver higher accuracy.
We also provide our Recognize API and RTC Callback APIs ; both of these support speech grammars. Developers may code the application flow/dialog of the voicebot in any backend programming language - Java, Python, Node.Js. We have extensive support for telephony protocols like SIP/RTP and we support WebRTC.
Most other STT engines - including Microsoft, Amazon and Google - do not support grammars. This may have something to do with the fact that they are also trying to promote their NLU engines for chatbot applications.
If you are building a Voicebot and you'd like to have a discussion on which approach suits you, do not hesitate to get in touch with us. You can email us at info@voicegain.ai.
Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Interested in customizing the ASR or deploying Voicegain on your infrastructure?