Today we are really excited to announce the launch of Voicegain Whisper, an optimized version of Open AI's Whisper Speech recognition/ASR model that runs on Voicegain managed cloud infrastructure and accessible using Voicegain APIs. Developers can use the same well-documented robust APIs and infrastructure that processes over 60 Million minutes of audio every month for leading enterprises like Samsung, Aetna and other innovative startups like Level.AI, Onvisource and DataOrb.
The Voicegain Whisper API is a robust and affordable batch Speech-to-Text API for developersa that are looking to integrate conversation transcripts with LLMs like GPT 3.5 and 4 (from Open AI) PaLM2 (from Google), Claude (from Anthropic), LLAMA 2 (Open Source from Meta), and their own private LLMs to power generative AI apps. Open AI open-sourced several versions of the Whisper models released. With today's release Voicegain supports Whisper-medium, Whisper-small and Whisper-base. Voicegain now supports transcription in over multiple languages that are supported by Whisper.
Here is a link to our product page
There are four main reasons for developers to use Voicegain Whisper over other offerings:
While developers can use Voicegain Whisper on our multi-tenant cloud offering, a big differentiator for Voicegain is our support for the Edge. The Voicegain platform has been architected and designed for single-tenant private cloud and datacenter deployment. In addition to the core deep-learning-based Speech-to-text model, our platform includes our REST API services, logging and monitoring systems, auto-scaling and offline task and queue management. Today the same APIs are enabling Voicegain to processes over 60 Million minutes a month. We can bring this practical real-world experience of running AI models at scale to our developer community.
Since the Voicegain platform is deployed on Kubernetes clusters, it is well suited for modern AI SaaS product companies and innovative enterprises that want to integrate with their private LLMs.
At Voicegain, we have optimized Whisper for higher throughput. As a result, we are able to offer access to the Whisper model at a price that is 40% lower than what Open AI offers.
Voicegain also offers critical features for contact centers and meetings. Our APIs support two-channel stereo audio - which is common in contact center recording systems. Word-level timestamps is another important feature that our API offers which is needed to map audio to text. There is another feature that we have for the Voicegain models - enhanced diarization models - which is a required feature for contact center and meeting use-cases - will soon be made available on Whisper.
We also offer premium support and uptime SLAs for our multi-tenant cloud offering. These APIs today process over 60 millions minutes of audio every month for our enterprise and startup customers.
OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture of the model is based on encoder-decoder transformers system and has shown significant performance improvement compared to previous models because it has been trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.
Learn more about Voicegain Whisper by clicking here. Any developer - whether a one person startup or a large enterprise - can access Voicegain Whisper model by signing up for a free developer account. We offer 15,000 mins of free credits when you sign up today.
There are two ways to test Voicegain Whisper. They are outlined here. If you would like more information or if you have any questions, please drop us an email support@voicegain.ai
Some of the feedback that we received regarding the previously published benchmark data, see here and here, was concerning the fact that the Jason Kincaid data set contained some audio that produced terrible WER across all recognizers and in practice no one would user automated speech recognition on such files. That is true. In our opinion, there are very few use cases where WER worse than 20%, i.e. where on average 1 in every 5 words is recognized incorrectly, is acceptable.
What we have done for this blog post is we have removed from the reported set those benchmark files for which none on the recognizers tested could deliver WER 20% or less. This criterion resulted in removal of 10 files - 9 from the Jason Kincaid set of 44 and 1 file from the rev.ai set of 20. The files removed fall into 3 categories:
As you can see, Voicegain and Amazon recognizers are very evenly matched with average WER differing only by 0.02%, the same holds for Google Enhanced and Microsoft recognizer with the WER difference being only 0.04%. The WER of Google Standard is about twice of the other recognizers.
[UPDATE - October 31st, 2021: Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced. Our pricing is now 0.95 cents/minute]
[UPDATE: For results reported using slightly different methodology see our new blog post.]
This is a continuation of the blog post from June where we reported the previous speech-to-text accuracy results. We encourage you to read it first, as it sets up a context to better understand the significance of benchmarking for speech-to-text.
Apart for that background intro, the key differences from the previous post are:
Here are the results.
Less than 3 months have passed from the previous test, so it is not surprising to see no improvement on Google and Amazon recognizers.
Voicegain recognizer has how overtaken Amazon by a hair breadth in average accuracy, although Amazon median accuracy on this data set is slightly above Voicegain.
Microsrosoft recognizer has improved during this time period - on the 44 benchmark files it is now on average better than Google Enhanced (in the chart we retained ordering from the June test). The single bad outlier in Google Enhanced results does alone not account for the better average WER on the Microsoft on this data set.
Google Standard is still very bad and we will likely stop reporting on it in detail in our future comparisons.
The audio from the 20-file rev.ai test is not as challenging as some of the files in the 44-file benchmark set. Consequently the results are on average better but the ranking of the recognizers does not change.
As you can see in this chart, on this data set the Voicegain recognizer is marginally better than Amazon in. It has lower WER on 13 out of 20 test files and it beats Amazon in the mean and median values. On this data set Google Enhanced beats Microsoft.
Finally, here are the combined results for all the 64 benchmark files we tested.
On the combined benchmark Voicegain beats Amazon both in average and median WER, although the median advantage is not as big as on the 20 file rev.ai set. [Note that as of 2/10/21 Voicegain WER is now 16.46|14.26]
What we would like to point out is that when comparing Google Enhanced to Microsoft, one wins if we compare the average WER while the other has a better median WER value. This highlights that the results vary a lot depending on what specific audio file is being compared.
These results show that choosing the best recognizer for a given application should be done only after thorough testing. Performance of the recognizers varies a lot depending on the audio data and acoustic environment. Moreover, the prices vary significantly. We encourage you to try the Voicegain Speech-to-Text engine for your application. It might be a better fit for your application. Even if the accuracy is a couple of points behind the two top players, you might still want to consider Voicegain because:
Voicegain launched an extension to Voicegain /asr/recognize API that supports Twilio Media Streams via TwiML <Connect><Stream>. With this launch, developers using Twilio's Programmable Voice get an accurate, affordable, and easy to use ASR to build Voice Bots /Speech-IVRs.
Update: Voicegain also announced that its large vocabulary transcription (/asr/transcribe API) integrates with Twilio Media Streams. Developers may use this to voice enable a chat bot developed on any bot platform or develop a real-time agent assist application.
Voicegain Twilio Media Streams support gives developers the following features:
TwiML <Stream> requires a websocket url. This url can be obtained by invoking Voicegain /asr/recognize/async API. When invoking this API the grammar to be used in the recognition has to be provided. The websocket URL will be returned in the response.
In addition to the wss url, Custom Parameters within <Connect><Stream> command are used to pass information about the question prompt to be played to the caller by Voicegain. This can be a text or a url to a service that will provide the audio.
Once <Connect><Stream> has been invoked, Voicegain platform takes over- it:
BTW, we also support DTMF input as an alternative to speech input.
[UPDATE: you can see more details of how to use Voicegain with Twilio Media Streams in this new Blog post.]
1. On Premise Edge Support: While Voicegain APIs are available as a cloud PaaS service, Voicegain also supports OnPrem/Edge deployment. Voicegain can be deployed as a containerized service on a single node Kubernetes cluster, or onto multi-node high-availability Kubernetes cluster (on your GPU hardware or your VPC).
2. Acoustic model customization: This allows to achieve very high accuracy beyond what is possible with out of the box recognizers. The grammar tuning and regression tool mentioned earlier, can be used to collect training data for acoustic model customization.
On our near-term roadmap for Twilio users we have several more features:
You can sign up to try our platform. We are offering 600 minutes of free monthly use of the platform. If you have questions about integration with Twilio, send us a note at support@voicegain.ai.
Twilio, TwiML and Twilio Programmable Voice are registered trademarks of Twilio, Inc
Businesses of all sizes are looking to develop Voicebots to automate customer service calls or voice based sales interactions. These bots may be voice versions of existing Chatbots, or exclusively voice based bots. While Chatbots automate routine transactions over the web, many users like the ability to use voice (app or phone) when it is convenient.
A voice bot dialog consists of multiple interactions where a single interaction typically involves 3 steps:
For the first step, developers use a Speech-to-Text platform to transcribe the spoken utterance into text. ASR or Automatic Speech Recognition is another term that is used to describe the same type of software.
When it comes to extracting intent from the customer utterance, they typically use an NLU engine. This is understandable because developers would like to re-use the dialog flow or conversation turns programmed in their Chatbot App for their Voicebot.
A second option is to use Speech Grammars which match the spoken utterance and assign meaning (intent) to it. This option is not in vogue these days but Speech Grammars have been successfully used in telephony IVR systems that supported speech interaction using ASR.
This article explores both approaches to building Voicebots.
Most developers today use the NLU approach as a default option for Steps 2 and 3. Popular NLU engines include Google Dialog Flow, Microsoft LUIS, Amazon Lex and also increasingly an open source framework like RASA.
An NLU Engine helps developers configure different intents that match training phrases, specify input and output contexts that are associated with these intents, and define actions that drive the conversation turns. This method of development is very powerful and expressive. It allows you to build bots that are truly conversational. If you use NLU to build a Chatbot you can generally reuse its application logic for a Voicebot.
But it has a significant drawback. You need to hire highly skilled natural language developers. Designing new intents, handling input and output contexts, entities etc is not easy. Since you require skilled developers, the development of bots using NLU is expensive. It is not just expensive to build but it is costly to maintain too. For example, if you want to add new skills to the bot that are beyond its initial set of capabilities, modifying the contexts is not an easy process.
Net-net the NLU approach is a really good fit if (a) you want to develop a sophisticated bot that can support a truly conversational experience (b) you are able to hire and engage skilled NLP developers and (c) you have adequate budgets to develop such bots.
One approach that was used in the past and seems to have been forgotten these days is the use of Speech Grammars. Grammars were used extensively to build traditional telephony based speech IVRs for over 20 years now, but most NLP and web developers are not aware of them.
A Speech Grammar provides either a list of all utterances that can be recognized, or, more commonly, a set of rules that can generate the utterances that can be recognized. Such grammar combines two functions:
The second function is achieved by attaching tags to the rules in the grammars. Tag formats exist that support complex expressions to be evaluated for grammars that have many nested rules. These tags allow the developer to essentially code intent extraction right into the grammar.
Also Step 3 - which is the dialog/conversation flow management - can be implemented in any backend programming language - Java, Python or Node.js. Developers of voice bots that are on a budget and are looking to building a simple bot with just a few intents should strongly consider grammars as an alternative approach to NLU.
Voicegain is one of the few Speech-to-Text or ASR engines that supports both approaches.
Developers can easily integrate Voicegain's large vocabulary speech-to-text (Transcribe API) with any popular NLU engine. One advantage that we have here is the ability to output multiple hypotheses - when using the word-tree output mode. This allows multiple NLU intent matches to be done of the different speech hypotheses with the goal of determining if the there is an NLU consensus in spite of differing speech-to-text output. This approach can deliver higher accuracy.
We also provide our Recognize API and RTC Callback APIs ; both of these support speech grammars. Developers may code the application flow/dialog of the voicebot in any backend programming language - Java, Python, Node.Js. We have extensive support for telephony protocols like SIP/RTP and we support WebRTC.
Most other STT engines - including Microsoft, Amazon and Google - do not support grammars. This may have something to do with the fact that they are also trying to promote their NLU engines for chatbot applications.
If you are building a Voicebot and you'd like to have a discussion on which approach suits you, do not hesitate to get in touch with us. You can email us at info@voicegain.ai.
Many applications of speech-to-text (STT) or speech recognition (ASR) require that the conversion from audio to text happen in realtime. These applications could be voice bots, live captioning of videos, events or talks, transcription of meetings, real time speech analytics of sales calls or agent assistance in a contact center.
An important question for developers looking to integrate real time STT into their apps is the choice of the protocol and/or mechanism to stream real time audio to the STT platform. While some STT vendors offer just one method; at Voicegain we offer multiple choices that developers could select from. In this post, we explore in detail all these methods so that a developer could choose the right one for their specific use case.
Some of the factors that may guide the specific choice are:
At Voicegain we currently offer seven different methods/protocols to support streaming to our STT platform. The first three are TCP based methods and the last four methods are UDP based.
Using WebSockets is a simple and popular option to stream audio to Voicegain for speech recognition. WebSockets have been around for a while and most web programming languages have libraries that support it. This option may be the easiest way to get started. Voicegain API is using binary WebSockets, and we have some simple examples to get you started.
Voicegain also supports streaming over HTTP 1.1 using chunked transfer encoding. This allows you to send raw audio data with unknown size, which is generally the case for streaming audio. Voicegain supports both pull and push scenarios - we can fetch the audio from a URL that you provide or the application can submit the audio to a URL that we provide. To use this method, your programming language should have libraries that support chunked transfer encoding over HTTP, some of the older or simpler HTTP libraries do not support it.
gRPC builds on top of HTTP/2 protocol which was designed to support long-running bi-directional connections. Moreover, gRPC uses Protocol buffers which are a more efficient data serialization format compared to JSON that is commonly used in RESTful HTTP APIs. Both these aspects of gRPC allow audio data to be efficiently sent over the same connection that is also used for sending commands and receiving results.
With gRPC, client side libraries can easily be generated for multiple languages, like Java, C#, C++, Go, Python, Node Js, etc. The generated client code contains stubs for use by gRPC clients to call the methods defined by the service.
Using gRPC, clients can invoke the Voicegain STT APIs like a local object whose methods expose the APIs. This method is a fast, efficient, and low-latency way to stream audio to Voicegain and receive recognition responses. The responses are sent over the same connection back from the server to client - this removes the need for polling or callbacks to get the results when using HTTP.
gRPC is great when used from the back-end code or from Android. It is not a plug and play solution when used from Web Browsers but requires some extra steps.
The first three methods described above are TCP based methods. They work great for audio streaming as long as the connection has no or minimal packet loss. Packet loss causes significant delays and jitter in the TCP connections. This may be fine if audio does not have to be processed truly real-time and can be buffered.
If real-time behavior is important and the network is known to be unreliable, the UDP protocol is a better alternative to TCP for audio streaming. With UDP, packet loss will manifest itself as audio dropouts, but that may be preferable to excessive pauses and jitter in case of TCP.
RTP is a standard protocol for audio streaming over UDP. However, RTP itself is is generally not sufficient and is normally used with accompanying RTP Control Protocol (RTCP). Voicegain has implemented its own variation of RTCP that can be used to control RTP audio streams sent to the recognizer.
Currently, the only way to to stream audio using RTP to Voicegain platform is to use our proprietary Audio Sender Java library. We also provide Audio Sender Daemon that is capable of reading data directly from audio devices and streaming it to Voicegain for real time transcription.
If you are looking to invoke Speech-to-text in a contact center, Voicegain offers Telephony Bot APIs. You can read more about them here. Essentially the Voicegain platform can act as a SIP endpoint and can be invited into a SIP session. We can do two things 1) As part of an IVR or Bot, play prompts and gather caller input 2) As part of a real-time agent assist, we can listen & transcribe the agent-caller interaction.
To elaborate on (1), with these APIs you can invite the Voicegain platform into a SIP session which provides Voicegain Speech-to-Text engine access to the audio. Once the audio stream gets established, you can issue commands to recognize call utterances and receive the recognition response using our web callbacks. You can write the logic of your application using any programming language or an NLU Engine of your choice - all that is needed is being able to handle HTTP requests and send responses.
Voicegain platform in this scenario essentially acts as a 'mouth' and an 'ear' to the entire conversation which happens over SIP/RTP. The application can issue JSON commands over HTTP that play prompts and convert caller speech into text through the entire duration of the call over a single session. You can also record the entire conversation if the call is transferred to a live agent and transcribe into text.
Contact center platform vendors like Cisco, Genesys, Avaya and FreeSWITCH based CCaaS platforms usually support MRCP to connect to Speech Recognition engines. Voicegain supports access over MRCP to both large vocabulary and grammar based speech recognition. We recommend MRCP only for Edge, Private Cloud or On-premise deployments
In Contact Centers, for real-time transcription of the agent caller interaction, Voicegain supports SIPREC. Further information is provided here.
1. Click here for instructions to access our live demo site.
2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account and receive $50 in free credits
3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.
Update Dec 2020: We have renamed RTC Callback APIs to Telephony Bot APIs to better reflect how developers can use these APIs - which is build Voice Bots, IVRs.
If you have wanted to voice enable your Chatbot or build your own Telephony based Voice Bot or a Speech-enabled IVR, Voicegain has built an API that is really cool - Release 1.12.0 of Voicegain Speech-to-Text Platform now includes Telephony Bot APIs (formerly called RTC Callback APIs in the past).
Voicegain Telephony Bot APIs enables any NLU/Bot Framework to easily integrate with PSTN/telephony infrastructure using either (a) SIP INVITE of Voicegain platform from a CPaaS platform of your choice or (b) purchasing a phone number directly from Voicegain portal and pointing it to your Bot. You can then use these callback style APIs to (i) play prompts (ii) recognize speech utterances or DTMF digits (iii) allow for barge-in and several other exciting features. We offer sample code that will help you easily integrate a Bot Framework of your choice to our Telephony Bot APIs.
If you do not have a Bot Framework, thats okay too. You can write the logic in any backend programming language (Python, Java or Node.JS) that can serialize responses in a JSON format and interact with our Callback style APIs. Voicegain also offers a declarative YAML format to define the call flow and you can host this YAML file logic and interact with these APIs. Developers can also code and deploy the application logic in a server-less computing environment like Amazon Lambda.
Many enterprises - in banking, financial services , health care, telecom and retail - are stuck with legacy telephony based IVRs that are approaching obsolescence.
Voicegain's Telephony Bot APIs provide a great future-proof upgrade path for such enterprises. Since these APIs are based on web callbacks, they can interact with any backend programming language. So any backend web developer can design, build and maintain such apps.
With Telephony Bot APIs, integration becomes much simpler for developers.
1) You can SIP INVITE the Voicegain Speech-to-Text/ASR platform to a SIP/RTP session for as long as is needed. We support SIP integration with CPaaS platforms like Twilio, Signalwire and Telnyx. We also support CCaaS platforms like Genesys, Cisco and Avaya.
2) We also support direct phone number ordering and SIP Trunks from the Voicegain Web Console. More integrations will be added soon.
Telephony Bot APIs are based on web callbacks where the actual program/ implementation is on the Client side and the Voicegain Telephony Bot APIs define the Requests and Responses. The meaning of Requests and Responses is reversed w.r.t what you would see in a normal Web API:
Below is an example of a simple phone call interaction which is controlled by Telephony Bot API. The sequence diagram shows 4 callbacks during a toy survey call:
Telephony Bot API supports 4 types of actions:
Each call can be recorded (two channel recording) and then transcribed. The recording and the transcript can be accessed from the portal as well as via the API.
Features coming soon:
Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.
Read more →Interested in customizing the ASR or deploying Voicegain on your infrastructure?