Our Blog

News, Insights, sample code & more!

ASR
Announcing the launch of Voicegain Whisper ASR/Speech Recognition API for Gen AI developers

Today we are really excited to announce the launch of Voicegain Whisper, an optimized version of Open AI's Whisper Speech recognition/ASR model that runs on Voicegain managed cloud infrastructure and accessible using Voicegain APIs. Developers can use the same well-documented robust APIs and infrastructure that processes over 60 Million minutes of audio every month for leading enterprises like Samsung, Aetna and other innovative startups like Level.AI, Onvisource and DataOrb.

The Voicegain Whisper API is a robust and affordable batch Speech-to-Text API for developersa that are looking to integrate conversation transcripts with LLMs like GPT 3.5 and 4 (from Open AI) PaLM2 (from Google), Claude (from Anthropic), LLAMA 2 (Open Source from Meta), and their own private LLMs to power generative AI apps. Open AI open-sourced several versions of the Whisper models released. With today's release Voicegain supports Whisper-medium, Whisper-small and Whisper-base. Voicegain now supports transcription in over multiple languages that are supported by Whisper. 

Here is a link to our product page


There are four main reasons for developers to use Voicegain Whisper over other offerings:

1. Support for Private Cloud/On-Premise deployment (integrate with Private LLMs)

While developers can use Voicegain Whisper on our multi-tenant cloud offering, a big differentiator for Voicegain is our support for the Edge. The Voicegain platform has been architected and designed for single-tenant private cloud and datacenter deployment. In addition to the core deep-learning-based Speech-to-text model, our platform includes our REST API services, logging and monitoring systems, auto-scaling and offline task and queue management. Today the same APIs are enabling Voicegain to processes over 60 Million minutes a month. We can bring this practical real-world experience of running AI models at scale to our developer community.

Since the Voicegain platform is deployed on Kubernetes clusters, it is well suited for modern AI SaaS product companies and innovative enterprises that want to integrate with their private LLMs.

2. Affordable pricing - 40% less expensive than Open AI 

At Voicegain, we have optimized Whisper for higher throughput. As a result, we are able to offer access to the Whisper model at a price that is 40% lower than what Open AI offers.

3. Enhanced features for Contact Centers & Meetings.

Voicegain also offers critical features for contact centers and meetings. Our APIs support two-channel stereo audio - which is common in contact center recording systems. Word-level timestamps is another important feature that our API offers which is needed to map audio to text. There is another feature that we have for the Voicegain models - enhanced diarization models - which is a required feature for contact center and meeting use-cases - will soon be made available on Whisper.

4. Premium Support and uptime SLAs.

We also offer premium support and uptime SLAs for our multi-tenant cloud offering. These APIs today process over 60 millions minutes of audio every month for our enterprise and startup customers.

About OpenAI-Whisper Model

OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture of the model is based on encoder-decoder transformers system and has shown significant performance improvement compared to previous models because it has been trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

OpenAI Whisper model encoder-decoder transformer architecture

Source

Getting Started with Voicegain Whisper

Learn more about Voicegain Whisper by clicking here. Any developer - whether a one person startup or a large enterprise - can access Voicegain Whisper model by signing up for a free developer account. We offer 15,000 mins of free credits when you sign up today.

There are two ways to test Voicegain Whisper. They are outlined here. If you would like more information or if you have any questions, please drop us an email support@voicegain.ai

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Test Voicegain realtime Speech-to-Text from your browser
Developers
Test Voicegain realtime Speech-to-Text from your browser

You can now test the accuracy of both our realtime and offline speech-to-text  by visiting our demo page.

Read out paragraphs of your favorite book, give a speech that inspires, mimic your favorite actor or just play a podcast or YouTube video!

Health check for the demo

  1. We currently support Chrome and Edge browser only.
  2. Please ensure that your CPU utilization is not too high (<50%) and your internet bandwidth is reasonable (10 Mbps both directions).
  3. Ensure that your microphone is not being used by another program like Zoom, Teams, Skype or Webex.

If you are noticing delays in real-time transcription results, they are  likely because of resource issues on your computer.

Real-time Transcription

Simply click on the microphone icon to get started. You can either speak or stream audio into your microphone from your browser for a full minute.

You can also play back the audio to make sure that it was indeed streamed to us accurately.

Offline Transcription

Click on the upload recording icon to get started. You can upload up a mono or stereo recorded file - wav or FLAC - that is up to 15MB in size. If you need to transcribe a larger file, please sign up for a free account.

Drop us an email (support@voicegain.ai) if you have any comments.

Read more → 
Speech-to-Text Accuracy Benchmark - June 2021
Benchmark
Speech-to-Text Accuracy Benchmark - June 2021

[UPDATE - October 31st, 2021:  Current benchmark results from end October 2021 are available here. In the most recent benchmark Voicegain performs better than Google Enhanced.]

It has been over 8 months since we published our last speech recognition accuracy benchmark (described  here). Back then the results were as follows (from most accurate to least): Microsoft and Google Enhanced (close 2nd), then Voicegain and Amazon (also close 4th) and then, far behind, Google Standard.

Methodology

We have repeated the test using the same methodology as before: take 44 files from the Jason Kincaid data set and 20 files published by rev.ai and remove all files where the best recognizer could not achieve a Word Error Rate (WER) lower than 20%. Last time we removed 10 files, but this time as the recognizers improved only 8 files had their WER higher than 20%.

The files removed fall into 3 categories:

  • recordings of meetings - 3 files (3 out of 7 meeting recordings in the original set),
  • telephone conversations - 3 files (3 out of 11 phone phone conversations in the original set),
  • multi-presenter, very animated podcasts - 2 files (there were a lot of other podcasts in the set that did meet the cut off).

Some of our customers told us that they previously used IBM Watson, so we decided to add also it to the test.

Results

In the new test, as you can see in the results chart above, the order has changed: Amazon has leap-frogged everyone by increasing its median accuracy by over 3% to just 10.02%, it is now in the pole position. Microsoft, Google Enhanced  and Google Standard performed at approximately the same level. The Voicegain recognizer improved by about 2%.  The newly tested IBM Watson is better than Google Standard, but lags the other recognizers.

Voicegain is tied with Google Enhanced

New results put Voicegain recognizer very close to Google enhanced:

  1. Average WER of Voicegain is just 0.66% behind Google, while median WER is just 0.63% behind. To put it in context -  Voicegain makes one additional mistake every 155 words compared to Google Enhanced.
  2. Voicegain was actually marginally better than Google Enhanced on the min error, 1st quartile, 3rd quartile, and max error.
  3. Overall Voicegain was better on 20 files while Google was better on 36 files.

However the results for a use case depends on the specific audio - for some of them Voicegain will perform slightly better and for some Google may perform marginally better. As always, we invite you to review our apps, sign-up and test our accuracy with your  data.

What about Open Source recognizers

We have looked at both the Mozilla DeepSpeech and Kaldi projects. We ran our complete benchmark on Mozilla DeepSpeech and found that it significantly trails behind Google Standard recognizer. Out of 64 audio files, Mozilla was better than Google Standard on only 5 files and tied on 1. It was worse on the remaining 58 files. Median WER was 15.63% worse for Mozilla compared to Google Standard. The lowest WER of 9.66% for Mozilla DeepSpeech was on audio from Librivox "The Art of War by Sun Tzu". For comparison, Voicegain achieves 3.45% WER on that file.

Regarding Kaldi we have not benchmarked it yet, but from the research published online it looks like Kaldi trails Google Standard too, at least when used with its standard ASpIRE and LibriSpeech models.

Out-of-the-box accuracy is not everything

When you have to select speech recognition/ASR software, there are other factors beyond out-of-the-box recognition accuracy. These factors are, for example:

  • Ability to customize the Acoustic Model - Voicegain model may be trained on your audio data - we have demonstrated improvement in accuracy of 7-10%. In fact for one of our customers with adequate training data and good quality audio we were able achieve a WER of 0.5% (99.5% accuracy)
  • Ease of integration - Many Speech-to-Text providers offer limited APIs especially for developers building applications that require interfacing with  telephony or on-premise contact center platforms.
  • Price - Voicegain is 60%-75% less expensive compared to other Speech-to-Text/ASR software providers while offering almost comparable accuracy. This makes it affordable to transcribe and analyze speech in large volumes.
  • Support for On-Premise/Edge Deployment - The cloud Speech-to-Text service providers offer limited support to deploy their speech-to-text software in client data-centers or on the private clouds of other providers. On the other hand, Voicegain can be installed on any Kubernetes cluster - whether managed by a large cloud provider or by the client.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.


2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account  and receive $50 in free credits


3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

Read more → 
Voicegain bietet automatische Spracherkennung in Deutsch
Languages
Voicegain bietet automatische Spracherkennung in Deutsch

Wir freuen uns, die Verfügbarkeit von deutscher Spracherkennung auf der Voicegain-Plattform bekannt zu geben. Es ist die dritte Sprache, die Voicegain nach Englisch und Spanisch unterstützt.

Die Spracherkennungsgenauigkeit des deutschen Modells hängt von der Art des Sprachaudios ab. Im Allgemeinen liegen wir nur wenige Prozent hinter der Genauigkeit zurück, die die Speech-to-Text-Engines von Amazon oder Google bieten. Der Vorteil unseres Spracherkennung ist der deutlich niedrigere Preis sowie die Möglichkeit, kundenspezifische Akustikmodelle zu trainieren. Benutzerdefinierte Modelle können eine höhere Genauigkeit aufweisen als Amazon oder Google. Wir empfehlen Ihnen, unsere Webkonsole und / oder API zu verwenden, um die tatsächliche Leistung Ihrer eigenen Daten zu testen.  

Natürlich bietet die Voicegain-Plattform auch andere Vorteile wie die Unterstützung von Edge-Bereitstellung (on-prem) und eine umfangreiche API mit vielen Optionen für die sofort einsatzbereite Integration in z. Telefonieumgebungen.

Derzeit ist unsere Speech-to-Text-API mit dem deutschen Modell voll funktionsfähig. Einige der Speech Analytics-API-Funktionen sind für Deutsch noch nicht verfügbar, z. B. Named Entity Recognition oder Sentiment / Mood Detection.

Das deutsche Modell ist zunächst nur in der Version verfügbar, die die Offline-Transkription unterstützt. Die Echtzeitversion des Modells wird in naher Zukunft verfügbar sein.

Um der API mitzuteilen, dass Sie das deutsche Akustikmodell verwenden möchten, müssen Sie es nur in den Kontexteinstellungen auswählen. Deutsche Modelle haben 'de' im Namen, z. VoiceGain-ol-de: 1

Wenn Sie die deutsche Sprachausgabe verwenden möchten, senden Sie uns bitte eine E-Mail an support@voicegain.ai. Wir werden sie für Ihr Konto aktivieren. Wenn Ihre Anwendung ein Echtzeitmodell erfordert, teilen Sie uns dies bitte ebenfalls mit.

Read more → 
Voicegain offers German Speech-to-Text
Languages
Voicegain offers German Speech-to-Text

We are pleased to announce availability of German Speech-to-Text on the Voicegain Platform. It is the third language that Voicegain supports after English and Spanish.

The recognition accuracy of the German model depends on the type of speech audio. Generally, we are just a few % behind the accuracy offered by the Speech-to-Text engines of the larger players (Amazon, Google, etc). The advantage of our recognizer is its affordability, ability to train customized acoustic models and deploy it in the datacenter or VPC. Custom models can have accuracy higher than that of Amazon or Google. We also offer extensive support for integrating with telephony.

We encourage you to sign up for a developer account and use our Web Console and/or our APIs to test the real-life performance on your own data.

Currently, our Speech-to-Text API supports the German Model. Currently the German Model supports off-line transcription. Real-time/Streaming version of the Model will be available in the near future.

To use the German Acoustic Model in Voicegain Web Console, select "de" under Languages in the Speech Recognition settings.

Read more → 
Access Voicegain ASR from FreeSWITCH using mod_unimrcp
Developers
Access Voicegain ASR from FreeSWITCH using mod_unimrcp

Voicegain STT platform has supported MRCP (Media Resource Control Protocol) for a long time now. Our ASR can be accessed using MRCP and we support both grammar-based recognition (e.g. GRXML) and large-vocabulary transcription. MRCP is a communication protocol designed to connect telephony based IVRs and Voice Bots with speech recognizers (ASR) and speech synthesizers (TTS).

Previously we tested connecting to Voicegain using MRCP from VXML platforms like Dialogic PowerMedia XMS or Aspect Prophecy. We had not tested connecting from FreeSWITCH, a popular open source telephony platform, using its MRCP plugin mod_unimrcp.

We are pleased to announce that Voicegain platform works out-of-the box with mod_unimrcp, the MRCP plugin for FreeSWITCH. However, getting the mod_unimrcp plugin to work on FreeSWITCH is not particularly trivial. Here are some pointers to help those who would like to use mod_unimrcp with our platform.


Deploying Voicegain unimrcp server

There are currently 2 options to do this. We plan to add a third option very soon  

  1. For production deployments of Speech IVRs and Voice Bots on FreeSWITCH, we recommend an Edge Deployment of the Voicegain platform. This will deploy our unimrcp server that can communicate with a locally deployed FreeSWITCH using MRCP.
  2. To use our Cloud ASR, you will need to download a MRCP IVR Proxy. This proxy can be downloaded from the Voicegain Web Console. You will download a tar file that has the definition of a docker compose that you can then run on your docker server. This will deploy our preconfigured unimrcp server with a proxy for connecting to Voicegain Cloud Speech-to-Text engine .
  3. (Coming soon) We plan to implement a voicegain_asr plugin that can be deployed on a standard unimrcp server. The plugin will talk to our ASR in the cloud using gRPC.

Also, the current TTS option accessible over MRCP are not great. Our focus has been on the use of prerecorded prompts for IVRs and Voice Bots. We plan to shortly allow developers to access the Google or Amazon TTS.


Configuring FreeSWITCH for mod_unimrcp

mod_unimrcp does not get built by default when you build FreeSWITCH from source. To get it built you need to enable it in build/modules.conf.in by uncommenting this line: #asr_tts/mod_unimrcp


After the build, before starting FreeSWITCH you will need to:

  • Add <load module="mod_unimrcp"/> to autoload_configs/modules.conf.xml(you can put it in <!-- ASR /TTS --> section because that is where it logically belongs)
  • Create mrcp_profile for voicegain (see below)
  • Modify content of autoload_config/unimrcp.conf.xmlIf you want to use both ASR and TTS via Voicegain MRCP, you will need to point both default-asr-profile and default-tts-profile to the voicegain1-mrcp2 profile you will create in mrcp_profiles folder.

Here is an example MRCP v2 profile for connecting to Voicegain MRCP:

Here are some additional notes about the configuration file:

  • It is important that the port range used by the Unimrcp Client:<param name="rtp-port-min" value="4000"/><param name="rtp-port-max" value="5000"/>is accessible from outside, otherwise, the TTS via MRCP will not work. Also, these ports may not overlap with the UDP ports used by FreeSWITCH.
  • In some setups the "auto" values of :<param name="client-ip" value="auto"/> and<param name="rtp-ip" value="auto"/>may not work and you will have to manually specify the external IP.

How to use mod_unimrcp

Here is an example of how to play a question prompt and to invoke the ASR via mod_unimrcp to recognize a spoken phone number:


session:execute("set", "tts_engine=unimrcp:voicegain1-mrcp2");
session:execute("set", "tts_voice=Catherine");
session:execute("play_and_detect_speech", 
"say:What is your phone number detect:unimrcp {start-input-timers=false,define-grammar=true,no-input-timeout=5000}builtin:grammar/phone")

asrResult = session:getVariable("detect_speech_result");

test

What this example does is:

  • tells FS which tts_egine to use
  • sets the TTS voice - currently ignored
  • plays a question prompt using the specified TTS and launches the recognition
  • retrieves the result of the speech recognition

The result of the recognition is a string in XML format (NLSML). You will need to parse it to get the utterance and any semantic interpretations. NLSML result also contains confidence.  


The normal command "play_and_detect_speech" holds onto ASR session until the end of the call - this makes subsequent recognitions more responsive, but you are paying for the MRCP session. You can also use this command "play_and_detect_speech_close_asr" to release ASR session immediately after recognition.


If you have any questions about the use of Voicegain ASR via MRCP please contact us at: support@voicegain.ai


Coming Soon

On our roadmap we have a mod_voicegain plugin for FreeSWITCH which will bypass the need for mod_unimrcp and unimrcp server and will be talking from FreeSWITCH directly to the Voicegain ASR using gRPC.

Read more → 
Implementing Real-time Agent Assist with Voicegain
Use Cases
Implementing Real-time Agent Assist with Voicegain

As pandemic forces Contact Centers to operate with work-from-home agents, managers are increasingly looking to real-time speech analytics to drive improvements in agent efficiency (via reduction in AHT) and effectiveness (improvements in FCR, NPS) and achieve 100% compliance.

Before the pandemic, Contact Center managers relied on a combination of in person supervision and speech analytics of recorded calls to drive improvements in agent efficiency and effectiveness.

However the pandemic has upended everything. It has forced contact centers to support work-from-home agents from multiple locations.  Team Leads who "walked the floor" and monitored  and assisted agents in realtime are not available any more. The offline Speech Analytics process - which is still available remotely - is limited and manual. A Call Coach or a QA Analyst coaches an agent manually using a sample 1-2% of the calls that have been transcribed and analyzed.

There is a now an urgent need to monitor and support agents real-time and provide them all tools and support that they had while they worked in their offices.

Real-time Agent Assist is the use of Artificial Intelligence - more specifically Speech Recognition and Natural Language Processing - to help agents real-time during the call in the following ways.

  1. Agents can be presented with knowledge-base articles and next-best actions from intents that are extracted from the transcribed text
  2. Using NLU algorithms and intents extracted, you can now summarize the call automatically and realize savings on disposition/wrap time
  3. Supervisors can monitor sentiment real-time

Real-time Agent Assist can reduce AHT by 30 seconds to 1 minute, improve FCR by 3-5% and improve NPS/CSAT.

What does it take to implement Real-time Agent Assist?

Real-time agent assist involves the realtime transcription of the Agent and Caller Interaction and extracting keywords, insights and intents from the transcribed text and make it available in a user-friendly manner to both the Agents and also the team-leads and supervisors.

There are 4 key steps involved:

  1. Audio Capture: The first step is to stream the two channels of audio (i.e agent and caller streams) from the Contact Center Platform that the client is using (whether premise based or cloud based). Voicegain supports a variety of protocols to stream audio. We have described them here and here.   We have integrated with premise-based major contact center platforms like Avaya, Cisco and Genesys. We have also integrated with Media Stream APIs from programmable CCaaS platforms like Twilio and SignalWire.
  2. Transcription: The next step in the process is to transcribe the audio streams into text . Voicegain offers Transcription APIs to convert the audio into text realtime. We can stream the text realtime (using web-sockets or gRPC) so that it can be easily integrated into any NLU Engine.
  3. NLU/Text Analytics:  In this step, the NLU engine extracts the intents from the transcribed text. These intents are trained in an earlier phase using phrases and sentences. Voicegain integrates with leading NLU Engines like RASA, Google Dialogflow, Amazon Lex and Salesforce Einstein.
  4. Integration with Agent Desktop: The last and final step is to integrate the results of the NLU with the Agent Desktop.

At Voicegain, we make it really easy to develop real-time agent assist applications . Sign up to test the accuracy of our real-time model.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control