This article outlines how the modern Voicegain deep-learning based Speech-to-Text/ASR can be a simple and affordable alternative for businesses that are looking for a quick and easy replacement to their on-premise Nuance Recognizer. Nuance has announced that its going to end support for Nuance Recognizer, its grammar-based ASR which uses the MRCP protocol, sometime in 2026 or 2027. So organizations that have a Speech-enabled IVR as their front door to the contact center need to start planning now.
With the rise of Generative AI and highly accurate low latency speech-to-text models, the front door of the call center is poised for major transformation. The infamous and highly frustrating IVR phone menu will be replaced by Conversational AI Voicebots; but this will likely happen over the next 3-5 years. As enterprises start to plan their migration journey from these tree-based IVRs to an Agentic AI future, they would like to do this on their timelines. In other words, they do not want to be forced to do this under the pressure of a deadline because of EOL of their vendor.
In addition, the migration path proposed by Nuance is a multi-tenant cloud offering. While a cloud based ASR/Speech-to-Text engine is likely to make sense for most businesses, there are companies in regulated sectors that are prevented from sending their sensitive audio data to a multi-tenant cloud offering.
In addition to the EOL announcement by Nuance for their on-premise ASR, a major IVR platform vendor like Genesys has also announced that its premise-based offerings - Genesys Engage and Genesys Connect - will also approach EOL at the same time as the Nuance ASR.
So businesses that want a modern Gen AI powered Voice Assistant but want to keep the IVR on-premise in their datacenter or behind their firewall in a VPC will need to start planning very quickly what their strategy is going to be.
At Voicegain, we allow enterprises that are in this situation and want to remain on-premise or in their VPC with a modern Voicebot platform. This Voicebot platform runs on modern Kubernetes clusters and leverages the latest NVIDIA GPUs.
Rewriting the IVR Application logic to migrate from a tree-based IVR menu to a conversational Voice Assistant is a journey. It would require investments and allocation of resources. Hence a good first step is to simply replace the underlying Nuance ASR (and possibly the IVR platform too). This will guarantee that a company can migrate to a modern Gen-AI Voice Assistant on its timelines.
Voicegain offers a modern highly accurate deep-learning-based Speech-to-text engine trained on hundreds of thousands of hours of telephone conversations. It is integrated into our native modern telephony stack. It can also talk over the MRCP protocol with VoiceXML based IVR platforms and it supports the traditional Speech grammars (SRGS, JJSGF). Voicegain also supports a range of built-in grammars (like Zipcode, Dates etc).
As a result, it is a simple "drop-in" replacement to the Nuance Recognizer. There is no need to rewrite the current IVR application. Instead of pointing to the IP address of the Nuance Server, the VoiceXML platform just needs to be reconfigured to point to the IP address of the Voicegain ASR server. This should take no more than a couple of minutes.
In addition to the Voicegain ASR/STT engine, we also offer a Telephony Bot API. This is a callback style API that includes our native IVR platform and ASR/STT engine can be used to build Gen AI powered Voicebots. It integrates with leading LLMs - both cloud and open-source premise based - to drive a natural language conversation with the callers.
If you would like to discuss your IVR migration journey, please email us at sales@voicegain.ai . At Voicegain, we have decades of experience in designing, building and launching conversational IVRs and Voice Assistants.
Here is also a link to more information. Please feel free to schedule a call directly with one of our Co-founders.
We are really excited to announce the launch of Zoom Meeting Assistant for Local Recordings. This is immediately available to all users of Voicegain Transcribe that have a Windows device. The Zoom Meeting Assistant can be installed on computers that have Windows 10 or Windows 11 as the OS.
What are local recordings? Zoom offers two ways to record a meeting - 1) Cloud Recording: Zoom users may save the recording of the meeting on Zoom's Cloud. 2) Local Recording - The meeting recording is saved locally on the Zoom user's computer. These recordings are saved in the default Zoom folder on the file system. Zoom processes the recording and makes it available in this folder a few minutes after the meeting is complete.
Below is a screenshot of how a Zoom user can initiate a local recording.
There are four big benefits of using Local Recordings
To use Voicegain Zoom Meeting Assistant, there are just two requirements
1. Users should first sign up for a Voicegain Transcribe account. Voicegain offers a free plan forever (up to 2 hours of transcription per month) and users can sign up using this link. You can learn more about Voicegain Transcribe here.
2. They should have a computer with Windows 10 or 11 as the OS.
This Windows App can be downloaded from the "Apps" page on Voicegain Transcribe. Once the app is installed, users will be able to access it on their Windows Taskbar (or Tray). All they need to do is to log into the Voicegain Transcribe App from the Meeting Assistant by entering their Transcribe user-id and password.
Once the Meeting Assistant App is logged into Voicegain Transcribe, it does two things
1. It constantly scans the Zoom folder for any new local recordings of Meetings. As soon as it finds such a recording, it submits/uploads it to Voicegain Transcribe for transcription, summarization and extraction of Key Items (Actions, Issues, Sales Blockers, Questions, Risks etc.)
2. It can also join any Zoom Meeting as the Users AI Assistant. Also this feature works whether the user is the Host of the Zoom Meeting or just a Participant . By joining the meeting, the Meeting Assistant is able to collect information on all the participants in the meeting.
While the current Meeting Assistant App works only for Windows users, Voicegain has native apps for Mac, Android and iPhone as part of its product roadmap.
Send us an email at support@voicegain.ai if you have any questions.
It has been another 6 months since we published our last speech recognition accuracy benchmark. Back then, the results were as follows (from most accurate to the least): Microsoft, then Amazon closely followed by Voicegain, then new Google latest_long and Google Enhanced last.
While the order has remained the same as the last benchmark, three companies - Amazon, Voicegain and Microsoft showed significant improvement.
Since the last benchmark, at Voicegain we invested in more training - mainly lectures - conducted over zoom and in a live setting. Training on this type of data resulted in a further increase in the accuracy of our model. We are actually in the middle of a further round of training with a focus on call center conversations.
As far as the other recognizers are concerned:
We have repeated the test using similar methodology as before: used 44 files from the Jason Kincaid data set and 20 files published by rev.ai and removed all files where none of the recognizers could achieve a Word Error Rate (WER) lower than 25%.
This time again only one file was that difficult. It was a bad quality phone interview (Byron Smith Interview 111416 - YouTube) with WER of 25.48%
We publish this since we want to ensure that any third party - any ASR Vendor, Developer or Analyst - to be able to reproduce these results.
You can see box-plots with the results above. The chart also reports the average and median Word Error Rate (WER)
Only 3 recognizers have improved in the last 6 months.
Detailed data from this benchmark indicates that Amazon is better than Voicegain on audio files with WER below the median and worse on audio files with accuracy above the median. Otherwise, AWS and Voicegain are very closely matched. However we have also run a client-specific benchmark where it was the other way around - Amazon as slightly better on audio files with WER above the median than Voicegain, but Voicegain was better on audio files with WER below the median. Net-net, it really depends on type of audio files, but overall, our results indicate that Voicegain is very close to AWS.
Let's look at the number of files on which each recognizer was the best one.
We now have done the same benchmark 5 times so we can draw charts showing how each of the recognizers has improved over the last 2 years and 3 months. (Note for Google the latest 2 results are from latest-long model, other Google results are from video enhanced.)
You can clearly see that Voicegain and Amazon started quite bit behind Google and Microsoft but have since caught up.
Google seems to have the longest development cycles with very little improvement since Sept. 2021 till about half a year ago. Microsoft, on the other hand, releases an improved recognizer every 6 months. Our improved releases are even more frequent than that.
As you can see, the field is very close and you get different results on different files (the average and median do not paint the whole picture). As always, we invite you to review our apps, sign-up and test our accuracy with your data.
When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:
1. Click here for instructions to access our live demo site.
2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account and receive $50 in free credits
3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.
Enterprises are increasingly looking to mine the treasure trove of insights from voice conversations using AI. These conversations take place daily on video meeting platforms like Zoom, Google Meet and Microsoft Teams and over telephony in the contact center (which take place on CCaaS or on-premise contact center telephony platforms).
Voice AI or Conversational AI refers to converting the audio from these conversations into text using Speech recognition/ASR technology and mining the transcribed text for analytics and insights using NLU. In addition to this, AI can be used to detect sentiment, energy and emotion in both the audio and text. The insights from NLU include extraction of key items from meetings. This include semantically matching phrases associated with things like action items. issues, sales blockers, agenda etc.
Over the last few years, the conversational AI space has seen many players launch highly successful products and scale their businesses. However most of these popular Voice AI options available in the market are multi-tenant SaaS offerings. They are deployed in a large public cloud provider like Amazon, Google or Microsoft. At first glance, this makes sense. Most enterprise software apps that automate workflows in functional areas like Sales and Marketing(CRM), HR, Finance/Accounting or Customer service are architected as multi-tenant SaaS offerings. The move to Cloud has been a secular trend for business applications and hence Voice AI has followed this path.
However at Voicegain, we firmly believe that a different approach is required for a large segment of the market. We propose an Edge architecture using a single-tenant model is the way to go for Voice AI Apps.
By Edge, we mean the following
1) The AI models for Speech Recognition/Speech-to-Text and NLU run on the customer's single tenant infrastructure – whether it is bare-metal in a datacenter or on a dedicated VPC with a cloud provider.
2) The Conversational AI app -which is usually a browser based application that uses these AI models is also completely deployed behind the firewall.
We believe that the advantages for Edge/On-Prem architecture for Conversational/Voice AI is being driven by the following four factors
Very often, conversations in meetings and call centers are sensitive from a business perspective. Enterprise customers in many verticals (Financial Services, Health Care, Defense, etc) are not comfortable storing the recordings and transcripts of these conversations on the SaaS Vendor's cloud infrastructure. Think about a highly proprietary information like product strategy, status of key deals, bugs and vulnerabilities in software or even a sensitive financial discussion prior to the releasing of earnings for a public company. Many countries also impose strict data residency requirements from a legal/compliance standpoint. This makes the Edge (On-Premises/VPC) architecture very compelling.
Unlike pure workflow-based SaaS applications, Voice AI apps include deep-learning based AI Models –Speech-to-Text and NLU. To extract the right analytics, it is critical that these AI models – especially the acoustic models in the speech-recognition/speech-to-text engine are trained on client specific audio data. This is because each customer use case has unique audio characteristics which limit the accuracy of an out-of-the-box multi-tenant model. These unique audio characteristics relate to
1. Industry jargon – acronyms, technical terms
2. Unique accents
3. Names of brands, products, and people
4. Acoustic environment and any other type of audio.
However, most AI SaaS vendors today use a single model to serve all their customers. And this results in sub-optimal speech recognition/transcription which in turn results in sub-optimal NLU.
For real-time Voice AI apps - for e.g in the Call Center - there is an architectural advantage for the AI models to be in the same LAN as the audio sources.
For many enterprises, SaaS Conversational AI apps are inexpensive to get started but they get very expensive at scale.
Voicegain offers an Edge deployment where both the core platform and a web app like Voicegain Transcribe can operate completely on our clients infrastructure. Both can be placed "behind an enterprise firewall".
Most importantly Voicegain offers a training toolkit and pipeline for customers to build and train custom acoustic models that power these Voice AI apps.
If you have any question or you would like to discuss this in more detail, please contact our support team over email (support@voicegain.ai)
As we announced here, Voicegain Transcribe is an AI based Meeting Assistant that you can take with you to all your work meetings. So irrespective of the meeting platform - Zoom, Microsoft Teams, Webex or Google Meet - Voicegain Transcribe has a way to support you.
We now have some exciting news for those users that regularly host Zoom meetings. Voicegain Transcribe users who are on Windows now have a free, easy and convenient way to access all their meeting transcripts and notes from their Zoom meetings. Transcribe Users can now download a new client app that we have developed - Voicegain Zoom Meeting Assistant for Local Recordings - onto their device.
With this client app, any Local Recording of a Zoom meeting (explained below) will be automatically submitted to Voicegain Transcribe. Voicegain's highly accurate AI models subsequently process the recording to generate both the transcript (Speech-to-Text) but also the minutes of the meeting and the topics discussed (NLU).
As always, you get started with a free plan that does not expire. So you can get started today without having to setup your payment information.
Zoom provides two options to record meetings on its platform - 1) Local Recording and 2) Cloud Recording.
Zoom Local recording is a recording of the meeting that is saved on the hard disk of the user's device. There are two distinct benefits of using Zoom Local Recording
Zoom Cloud Recording is when the recording of the meeting is stored on your Zoom Cloud account on Zoom's servers. Currently Voicegain does not directly integrate with Zoom Cloud Recording (however it is on our roadmap). In the interim, a user may download the Cloud Recording and upload it to Voicegain Transcribe in order to transcribe and analyze recordings saved in the cloud.
Zoom allows you to record individual speaker audio tracks separately as independent audio files. The screenshot above shows how to enable this feature on Zoom.
Voicegain Zoom Meeting Assistant for Local Recording supports uploading these independent audio files to Voicegain Transcribe so that you can get accurate speaker transcripts
The entire Voicegain platform including the Voicegain Transcribe App and the AI models can be deployed On-Premise (or in VPC) giving an enterprise a fully secure meeting transcription and analytics offering.
If you have any question, please sign up today, and contact our support team using the App.
Since June 2020, Voicegain has published benchmarks on the accuracy of its Speech-to-Text relative to big tech ASRs/Speech-to-Text engines like Amazon, Google, IBM and Microsoft.
The benchmark dataset for this comparison has been a 3rd Party dataset published by an independent party and it includes a wide variety of audio data – audiobooks, youtube videos, podcasts, phone conversations, zoom meetings and more.
Here is a link to some of the benchmarks that we have published.
1. Link to June 2020 Accuracy Benchmark
2. Link to Sep 2020 Accuracy Benchmark
3. Link to June 2021 Accuracy Benchmark
4. Link to Oct 2021 Accuracy Benchmark
5. Link to June 2022 Accuracy Benchmark
Through this process, we have gained insights into what it takes to deliver high accuracy for a specific use case.
We are now introducing an industry-first relative Speech-to-Text accuracy benchmark to our clients. By "relative", Voicegain’s accuracy (measured by Word Error Rate) shall be compared with a big tech player that the client is comparing us to. Voicegain will provide an SLA that its accuracy vis-à-vis this big tech player will be practically on-par.
We follow the following 4 step process to calculate relative accuracy SLA
In partnership with the client, Voicegain selects benchmark audio dataset that is representative of the actual data that the client shall process. Usually this is a randomized selection of client audio. We also recommend that clients retain their own independent benchmark dataset which is not shared with Voicegain to validate our results.
Voicegain partners with industry leading manual AI labeling companies to generate a 99% human generated accurate transcript of this benchmark dataset. We refer to this as the golden reference.
On this benchmark dataset, Voicegain shall provide scripts that enable clients to run a Word Error Rate (WER) comparison between the Voicegain platform and any one of the industry leading ASR providers that the client is comparing us to.
Currently Voicegain calculate the following two(2) KPIs
a. Median Word Error Rate: This is the median WER across all the audio files in the benchmark dataset for both the ASRs
b. Fourth Quartile Word Error Rate: After you organize the audio files in the benchmark dataset in increasing order of WER with the Big Tech ASR, we compute and compare the average WER of the fourth quartile for both Voicegain and the Big Tech ASR
So we contractually guarantee that Voicegain’s accuracy for the above 2 KPIs relative to the other ASR shall be within a threshold that is acceptable to the client.
Voicegain measures this accuracy SLA twice in the first year of the contract and annually once from the second year onwards.
If Voicegain does not meet the terms of the relative accuracy SLA, then we will train the underlying acoustic model to meet the accuracy SLA. We will take on the expenses associated with labeling and training . Voicegain shall guarantee that it shall meet the accuracy SLA within 90 days of the date of measurement.
1. Click here for instructions to access our live demo site.
2. If you are building a cool voice app and you are looking to test our APIs, click here to sign up for a developer account and receive $50 in free credits
3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.
Twilio platform supports encrypted call recordings. Here is Twillo documentation regarding how to setup encryption for the recordings on their platform.
Voicegain platform supports direct intake of encrypted recordings from the Twilio platform.
The overall diagram of how all of the components work together is as follows:
Bellow we describe how to configure a setup that will automatically submit encrypted recordings from Twilio to Voicegain transcription as soon as those recordings are completed.
Voicegain will require a Private Key in a PKCS#8 format to decrypt Twilio recordings. Twilio documentation describes how to generate a Private Key in that format.
Once you have the key, you need to upload it via Voicegain Web Console to the Context that you will be using for transcription. This can be done via Settings -> API Security -> Auth Configuration. You need to choose Type: Twilio Encrypted Recording.
We will be handling Twilio recording callbacks using an AWS Lambda function, but you can use an equivalent from a different Cloud platform or you can have your own service that handles https callbacks.
A sample AWS Lambda function in Python is available on Voicegain Github: platform/AWS-lambda-for-encrypted-recordings.py at master · voicegain/platform (github.com)
You will need to modify that function before it can be used.
First you need to enter the following parameters:
The Lambda function receives the callback from Twilio, parses the relevant info from it, and then submits a request to Voicegain STT API for OFFLINE transcription. If you want, you can modify, in the Lambda function code, the body of the request that will be submitted to Voicegain. For example, the github sample submits the results of transcription to be viewable in the Web Console (Portal), but you will likely want to change that, so that the results are submitted via a Callback to your HTTPS endpoint (there is a comment indicating where the change would need to be made).
You can also make other changes to the body of the request as needed. For the complete spec of the Voicegain Transcribe API see here.
Here is a simple python code that can be used to make an outbound Twilio call which will be recorded and then submitted for transcription.
Notice that:
Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Interested in customizing the ASR or deploying Voicegain on your infrastructure?