Our Blog

News, Insights, sample code & more!

ASR
Announcing the launch of Voicegain Whisper ASR/Speech Recognition API for Gen AI developers

Today we are really excited to announce the launch of Voicegain Whisper, an optimized version of Open AI's Whisper Speech recognition/ASR model that runs on Voicegain managed cloud infrastructure and accessible using Voicegain APIs. Developers can use the same well-documented robust APIs and infrastructure that processes over 60 Million minutes of audio every month for leading enterprises like Samsung, Aetna and other innovative startups like Level.AI, Onvisource and DataOrb.

The Voicegain Whisper API is a robust and affordable batch Speech-to-Text API for developersa that are looking to integrate conversation transcripts with LLMs like GPT 3.5 and 4 (from Open AI) PaLM2 (from Google), Claude (from Anthropic), LLAMA 2 (Open Source from Meta), and their own private LLMs to power generative AI apps. Open AI open-sourced several versions of the Whisper models released. With today's release Voicegain supports Whisper-medium, Whisper-small and Whisper-base. Voicegain now supports transcription in over multiple languages that are supported by Whisper. 

Here is a link to our product page


There are four main reasons for developers to use Voicegain Whisper over other offerings:

1. Support for Private Cloud/On-Premise deployment (integrate with Private LLMs)

While developers can use Voicegain Whisper on our multi-tenant cloud offering, a big differentiator for Voicegain is our support for the Edge. The Voicegain platform has been architected and designed for single-tenant private cloud and datacenter deployment. In addition to the core deep-learning-based Speech-to-text model, our platform includes our REST API services, logging and monitoring systems, auto-scaling and offline task and queue management. Today the same APIs are enabling Voicegain to processes over 60 Million minutes a month. We can bring this practical real-world experience of running AI models at scale to our developer community.

Since the Voicegain platform is deployed on Kubernetes clusters, it is well suited for modern AI SaaS product companies and innovative enterprises that want to integrate with their private LLMs.

2. Affordable pricing - 40% less expensive than Open AI 

At Voicegain, we have optimized Whisper for higher throughput. As a result, we are able to offer access to the Whisper model at a price that is 40% lower than what Open AI offers.

3. Enhanced features for Contact Centers & Meetings.

Voicegain also offers critical features for contact centers and meetings. Our APIs support two-channel stereo audio - which is common in contact center recording systems. Word-level timestamps is another important feature that our API offers which is needed to map audio to text. There is another feature that we have for the Voicegain models - enhanced diarization models - which is a required feature for contact center and meeting use-cases - will soon be made available on Whisper.

4. Premium Support and uptime SLAs.

We also offer premium support and uptime SLAs for our multi-tenant cloud offering. These APIs today process over 60 millions minutes of audio every month for our enterprise and startup customers.

About OpenAI-Whisper Model

OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture of the model is based on encoder-decoder transformers system and has shown significant performance improvement compared to previous models because it has been trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

OpenAI Whisper model encoder-decoder transformer architecture

Source

Getting Started with Voicegain Whisper

Learn more about Voicegain Whisper by clicking here. Any developer - whether a one person startup or a large enterprise - can access Voicegain Whisper model by signing up for a free developer account. We offer 15,000 mins of free credits when you sign up today.

There are two ways to test Voicegain Whisper. They are outlined here. If you would like more information or if you have any questions, please drop us an email support@voicegain.ai

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Voice command applications made easier
Announcements
Voice command applications made easier

New continuous recognition option

In latest Voicegain release (1.16.0) we have added a new option to our /asr/recognize/async API for ASR/speech-to-text. It is called continuousRecognition and if enabled modifies the default behavior of the grammar-based recognition.


Normally when /asr/recognize/async API is used the recognizer will return when the grammar is matched and the complete timeout expires. That means that it is only possible to get a single recognition in one /asr/recognize/async API request. If a no-match or no-input is detected the recognition will terminate.


However, sometimes there are use cases which demand that the recognizer e.g. ignores all no-matches until a match is found. This is what the continuousRecognition option is for.


With continuousRecognition you have fine control over which of the 4 events - no-input, no-match, match, and error - will be returned in a callback and which (if any) event will terminate recognition. If you do not set any event to terminate recogntion, the recognition session can be stopped by closing the audio stream or by returning stop:true from the callback.


What is it good for?

An example might be a use case where a voicemail is being played to a caller and during the playback we want to interpret caller commands like: stop, next, previous, save, delete. If we used normal recognition we would encounter situations where what is said was not understood. Stopping recognition on no-match would not make much sense because either: (1) re-prompting would mess up the flow of the call, or (2) restarting recognition  might introduce a gap in recognition that may result in missing a part what the caller said.


In scenario like this it is best to ignore no-match and continue to listen, the caller will notice no response to what he said and will naturally repeat that.


The settings for continuous recognition that would work in this case would be:

  • stopOn : match, error
  • noCallbackFor : no-input, no-match - notes: (1) in this case we suggest setting a noinputTimeout very long so that internally no no-inputs are generated, (2) application could also decide to accept no-match callbacks - they could be tracked and if too numerous acted upon.

Continuous Recognition is supported in Voicegain integration for Twilio Media Streams - either TwiML <Stream> or <Connect><Stream> in Twilio Programmable Voice

It is not yet supported in Voicegain Telephony Bot APIs.

Read more → 
Python script for testing automated speech recognition (ASR) accuracy
Developers
Python script for testing automated speech recognition (ASR) accuracy

Many of our customers have been asking us for help in benchmarking Voicegain speech-to-text recognizer (ASR) on their specific audio files. To make this benchmarking easier we have released a python script that accomplishes just that. With a single command line you can transcribe all audio files from the input directory and compare them against reference transcripts - calculating the WER for each file. You can also do a 2-way comparison of reference vs Voicegain transcript vs Google Speech-to-Text transcript.

The script and the documentation is available at: https://github.com/voicegain/platform/tree/master/utility-scripts/test-transcribe

See our benchmark blog post to give you an idea of what kind of accuracy to expect from the Voicegain recognizer.


Read more → 
Custom ASR with Acoustic Model Training - Two Case Studies
Model Training
Custom ASR with Acoustic Model Training - Two Case Studies

Updated: Feb 28 2022

In this blog post we describe two case studies to illustrate improvements in speech-to-text or ASR recognition accuracy that can be expected from training of the underlying acoustic models.  We trained our acoustic model to recognize Indian and Irish English better.

Case study setup

The Voicegain out-of-the-box Acoustic Model which is available as default on the Voicegain Platform had been trained to recognize mainly US English although our training data set did contain some British English audio. The training data did not contain Indian and Irish English, except for maybe accidental occurrences.

Both case studies were performed in an identical manner:

  • Training data contained about 300 hours of transcribed speech audio.
  • Training was done to get improved accuracy on the new type of data but at the same time to also retain the baseline accuracy. An alternative would have been to aim for maximum improvement on the new data at expense of accuracy of the baseline model.
  • Training was stopped after significant improvement was achieved. It could have been continued to achieve further improvement, although that might have been marginal.
  • Benchmarks presented here were done on data that was not included in the training set.

Case Study 1: Indian English

Here are the parameters of this study.

  • We had 250 hours of audio containing male and female speakers, each speaker reading about 50 minutes worth of speech audio.
  • We separated 6 speakers for the benchmark, selecting 3 male and 3 female samples. Samples were selected to contain both easy, medium, and difficult test cases.

Here are the results of the benchmark before and after training. For comparison. we also include results from Google Enhanced Speech-to-Text.

Some observations:

  • All 6 test speakers show significant improvement over the original accuracy.
  • After training the accuracy of 5 speakers is better than Google Enhanced Speech-to-Text. The one remaining speaker improved a lot (from 62% to 76%) but the accuracy was still not as good as Google. We examined the audio and it turns out that it was not recorded properly. The speaker was speaking very quietly and the microphone gain was set very high - this resulted in the audio containing a lot of strange artifacts, like e.g. tongue clicking. The speaker also ready the text in a very unnatural "mechanical" way. Kudos to Google for doing so well on such a bad recording.
  • On average custom-trained Voicegain speech-to-text was better by about 2% on our Indian English benchmark compared to Google Enhanced recognizer.

Case Study 2: Irish English

Here are the parameters of this study.

  • We collected about 350 hours of transcribed speech audio from one speaker from Northern Ireland.
  • For the benchmark we retained some audio from that speaker that was not used for training plus we found audio from 5 other speakers with various types of Irish English accents.  

Here are the results of the benchmark before and after training. We also include results from Google Enhanced Speech-to-Text.


Some observations:

  • The speaker that was used for training is labeled here as 'Legge'. We see huge improvement after training from 76.2% to 88.5% which is significantly above Google Enhanced with 83.9%
  • The other speaker with over 10% improvement is 'Lucas' which has a very similar accent to 'Legge'.
  • We looked in detail at the audio  of the speaker labeled 'Cairns' who had the least improvement and for whom Google was better than our custom trained recognizer. The audio has significantly lower quality that the other samples plus it contains noticeable echo. Its audio characteristics are quite different from that audio characteristics of the training data used.
  • On average custom trained Voicegain speech-to-text was better by about 1% on our Irish English benchmark compared to Google Enhanced recognizer.

Further Observations

  • The amount of data used in training at 250-350 hours was not large given that normally Acoustic Models for speech recognition are trained on 10s of thousands of hours of audio.
  • The large improvement on 'Legge' speaker suggest that if the goal is to improve recognition on very specific type of speech or speaker the training set could be lower, maybe 50 to 100 hours, to achieve significant improvement.  
  • Bigger training set may be needed - 500 hours or more - in cases where the variability of speech and other audio characteristics is large.

UPDATE Feb 2022

We have published 2 additional studies showing the benefits of Acoustic Model training:

Interested in Voicegain? Take us for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account  and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

Read more → 
Large vocabulary transcription for Twilio developers
CPaaS
Large vocabulary transcription for Twilio developers

In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.

The reasons we think it will be attractive to Twilio users are:

  • lower cost per each speech-to-text capture
  • higher accuracy for customers who choose Acoustic Model customization
  • access to all speech-to-text hypotheses in word-tree output mode

Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.

Initiating Speech Transcription with Voicegain

This is done by invoking Voicegain async transcribe API: /asr/transcribe/async

Below is an example of the payload needed to start a new transcription session:


Some notes about the content of the request:

  • we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)
  • startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
  • TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
  • asr settings include the two timeouts used in transcription - no-input, and complete timeouts.

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:



Some notes about the content of the TwiML request:

  • the websocket URL is the one returned from Voicegain /asr/transcribe/async request
  • more than one question prompt is supported - they will be played one after another
  • three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated   using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
  • bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Transcription Response

Below is an example response from the transcription  in case where "content" : {"full" : ["transcript"] } .



Read more → 
Live Transcription Example
Use Cases
Live Transcription Example

We want to share a short video showing live transcription in action at CBC. This one is using our baseline Acoustic Model. No customizations were made, no hints used. This video gives an idea of what latency is achievable with real-time transcription.


Automated real-time transcription is a great solution for accommodating hearing impaired if no sign-language interpreter is available. I can be used, e.g., at churches to transcribe sermons, at conventions and meetings to transcribe talks, at educational institutions (schools, universities) to live transcribe lessons and lectures, etc.

Voicegain Platform provides a complete stack to support live transcription:

  • Utility for audio capture at source
  • Cloud based or On-Prem transcription engine and API
  • Web portal for controlling multiple simultaneous live transcriptions
  • Web-based viewer app to enable following the transcription on any device with web browser. This app can also be embedded into any web page.

Very high accuracy - above that provided by Google, Amazon, and Microsoft Cloud speech-to-text - can be achieved through Acoustic Model customization.

Read more → 
How to use Voicegain with Twilio Media Streams
CPaaS
How to use Voicegain with Twilio Media Streams

Voicegain adds grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

The difference between Voicegain speech recognition and Twilio TwiML <Gather> is:

  1. Voicegain supports grammars with semantic tags (GRXML or JSGF) while <Gather> is a large vocabulary recognizer that just returns text, and
  2. Voicegain is  significantly cheaper (we will describe the price difference in an upcoming blog post).

When using Voicegain with Twilio, your application logic will need to handle callback requests from both Twilio and Voicegain.

Each recognition will involve two main steps described below:

Initiating Speech Recognition with Voicegain

This is done by invoking Voicegain async recognition API: /asr/recognize/async

Below is an example of the payload needed to start a new recognition session:

Some notes about the content of the request:

  • startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
  • TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
  • asr settings include the three standard timeouts used in grammar based recognition - no-input, complete, and incomplete timeouts
  • grammar is set to GRXML grammar loaded from an external URL

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, if the grammar is specified to recognize DTMF, the Voicegain recognizer will recognize DTMF signals included in the audio sent from Twilio Platform.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:


Some notes about the content of the TwiML request:

  • the websocket URL is the one returned from Voicegain /asr/recognize/async request
  • more than one question prompt is supported - they will be played one after another
  • three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated   using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
  • bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Recognition Response

Below is an example response from the recognition. This response is from built-in phone grammar.


Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control