Blog | Speech-to-Text Platform

Developers

Selecting a Speech-to-Text API for your SaaS app is not a slam dunk!

Arun Santhebennur

•

min read

•

December 29, 2020

Developers building voice-enabled SaaS applications that embed Speech-to-Text or Transcription as part of their product have multiple vendors to choose from.

However, the decision to pick the right Speech-to-Text platform or API is rather involved. This writeup outlines three types of vendors and the three key criteria (summarized as the 3 As - Accuracy, Affordability and Accessibility) to consider while making that choice.

Most voice-enabled SaaS apps that incorporate Speech-to-Text APIs broadly fall into two categories 1) Analytics and 2) Automation.

Whether you are developing an analytics app or an automation app, developers have the following vendor choices.

The Vendor Landscape

There are 3 distinct types of vendors

Big Three Cloud Providers
Large Enterprise ASR Platforms
Pure-play Speech-to-Text/Voice AI Startups

1. Big Three Cloud Providers

The first set of choices for most developers are Speech-to-Text APIs from the big cloud companies - Google, Amazon and Microsoft. These big players offer Speech-to-Text APIs as part of their portfolio of Cloud AI & ML services. The strategy for the Big Cloud providers is to sell their entire stack - from cloud infrastructure to APIs and even products.

However the Cloud service providers may compete directly with the developers they look to serve. E.g. Amazon Connect directly competes with Contact Center platforms that are hosted on AWS. Google Dialogflow directly competes with other NLU startups that may be looking to build and offer Voice bots and Voice Assistants to enterprises.

2. Large Enterprise ASR Platforms

Other than the big 3, Nuance and IBM Watson are large companies that have a rich history of providing Automated Speech Recognition (ASR). Of the two, Nuance is better known and has been a dominant player both in the enterprise call center market with its Nuance ASR engine and in the medical transcription space with its Dragon offering. IBM has a long history of fundamental speech recognition and IBM Watson Speech-to-Text is their developer oriented offering.

3. Pure-play Speech-to-Text/Voice AI Startups

Voicegain.ai, our company, plays alongside other startup companies like Deepgram that target SaaS developers with their best-of-breed DNN based speech-to-text. Since these startups are specialized providers, they are focused on beating the big cloud providers and legacy players with respect to price, performance and ease of use.

Key Criteria - Accuracy, Affordability, Accessibility

The key criteria while picking an ASR or Speech-to-Text platform are the 3 As - Accuracy, Affordability and Accessibility.

1. Accuracy - Establish Target & Baseline accuracy

The first and most important criteria for any Speech-to-Text platform is recognition accuracy. However accuracy is a tricky metric to assess and measure. There is no 'one-size-fits-all' approach to accuracy. We have shared our thoughts & benchmarks here. While Voicegain matches or exceeds the "out-of-the-box" transcription accuracy of most of the larger players, we suggest that you do additional diligence before making a choice. The audio datasets used in these benchmarks may or may not be similar to the use case or context for which the developer intends to use the API.

While accuracy is usually measured using Word Error Rate (WER), it is important to note that this metric too has limitations. For a SaaS app, getting some important and critical words right may be even more important than just a low overall WER.

That being said, it is important for developers to establish and calculate a quick baseline "out-of-the-box" accuracy for their application with their audio datasets.

At Voicegain, we have open sourced tools to benchmark our performance against the very best in business. We strongly recommend that developers & ML Engineers calculate a benchmark baseline accuracy for their vendor choices using a statistically significant volume of audio datasets for their application.

From a developer perspective, a baseline accuracy measure will provide insights into the how closely your datasets match the datasets that the underlying STT models from the vendors have been trained on.

Here are a set of important factors that may affect your "out-of-the-box" accuracy:

Length of audio: Does your application involve audio data that is comprised or short words/phrases or full sentences? Bots involve use of short words and phrases while analytics apps involve transcription of long sentences
Industry jargon: Does your audio data have industry specific jargon and terms that are not part of normal vocabulary?
Audio quality - 8kHz or 16 kHz: Is the source of your audio data - telephony - that is sampled at 8 kHz or is it 16 kHz data captured in a meeting platform like Zoom or Webex? Does the vendor have models that are tuned to 8 kHz and 16 kHz?
Separate channels: If there are multiple speakers, are you able to provide separate channels for each speaker to the Speech-to-Text engine? Accuracy could be higher if you are able to.
Background noise: Does your audio have a lot of background noise - e.g say news playing in the background or cross talk in a call center context. If so, how "sensitive" is the Speech-to-Text engine to such background noise
Accents: Does your application support speakers with different accents?

Developers also need to establish a "Target" accuracy that their SaaS application or product requires. Usually Product Managers determine this based on their needs.

It is possible to bridge the gap between the Target Accuracy and the Baseline "out-of-the-box" accuracy. While it is outside the scope of this post, here is an overview of some ways in which developers can improve upon the Baseline accuracy.

Acoustic Model Training. Voicegain allows developers to train the underlying acoustic model. This is the best way to address issues related to accents and background noise. Here is a link to some results we have demonstrated with model training. Of the larger players, currently only Microsoft and IBM allow for acoustic model customization whereas most players only allow customizing the language model (described below)
Language Model: Customizing the language model is usually the fastest and easiest way to boost the accuracy of the Speech-to-Text engine - especially for things like product names. This is accomplished in a few different ways. Some platforms allow developers to pass hints along with a recognition request while others allow you to load an entire corpus as a domain specific language model. Voicegain allows both options.
Speech Grammars: We have written extensively about Speech Grammars here, here and here and how they really simplify development of Voice Bots and Assistants. They boost accuracy of specific entities like zipcode, addresses, dates etc. They also improve recognition of short phrases like "card", "cash", etc. usually with better end results than using hints mentioned above. While Speech Grammars were commonly used for building telephony based Speech-enabled IVRs in the past (which were based on Speech-to-Text platforms based on HMMs and GMMs), most modern back-end developers are not familiar with the use of Grammars.

However not all Speech-to-Text platforms support one or more of these options.

At Voicegain.ai, we support all the above options. Picking the right approach involves a more in-depth technical conversation. We invite you to get in touch with us.

To summarize, the choice may not be as simple as picking the one with the best "out-of-the-box' accuracy. It could in fact be a platform that provides the most convenient and least expensive path to bridge the gap between Target and Baseline accuracy.

2. Affordability

The second most important factor after accuracy is price. Most SaaS products are very disruptively priced. It is not uncommon for the SaaS product to be sold at 'tens of dollars' ($35-100) per user per month. It is critical that Speech-to-Text APIs make up as small a fraction of the SaaS price as possible. The price directly impacts the "gross-margin" of the SaaS application, a critical financial metric/KPI that SaaS companies care dearly about.

In addition to the top-line usage based price for the platform, it is also important to understand what the minimum billable time and billing increment for each interaction. Many of the large Cloud providers have a very high minimum billable times - 12 or 18 seconds. This makes it very expensive for Voice Bots or Voice Assistant.

Another cost related aspect is the price for transcribing multi-channel audio, where only one speaker is active at the time. Does the platform charge for transcribing silence on the inactive channel ?

3. Accessibility - Ease/Simplicity of Integration

The last (but not the least!) important criterion is how accessible - or in other words how simple and easy is it to integrate the Speech-to-Text platform with the SaaS Application.

This ease of integration becomes even more important if the SaaS Application streams audio real-time to the Speech-to-Text platform. Another important criterion for real-time streaming is latency - which is the time to receive recognition results from the platform. For a Bot or Voice Assistant, it is important to get API latency down to 500 milliseconds or lower. Also, reliable and fast end-of-speech detection is crucial in those scenarios for natural dialog turn taking.

About Voicegain

At Voicegain, we support multiple options - ranging from TCP-based methods like gRPC and Websockets to telephony/UDP protocols like SIP/RTP, MRCP and SIPREC.

The choice made by the developer depends on the following factors:

The actual backend programming language or web framework that the SaaS app is built on (i.e., the libraries that they support).
Familiarity or past-experience in usage of certain protocols for developers
For apps that are accessed over traditional telephony (PSTN), integration with modern telephony platforms becomes really important (CCaaS, On Premise Contact Center or CPaaS Platforms like Twilio & SignalWire). At Voicegain, we integrate with most prominent Cloud and Premise based contact center platforms. We also allow you to use our JSON Callback based APIs with any platform that supports SIP Invite.

In conclusion, selecting the right Speech-to-Text or ASR platform for a SaaS application is a diligent exercise; it is by no means a slam dunk!!

We really dig having a conversation with you about this. We are keen to learn about what you are building Connect with us on LinkedIn, give us a shout!! Or email us at info@voicegain.ai.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

Platform

SIPREC support in Voicegain platform

Jacek Jarmulak

•

min read

•

December 21, 2020

Voicegain Speech-to-Text and Speech Analytics platform supports SIPREC protocol as one of the ways an audio stream of a telephone call can be fed to the speech recognizer.

The Session Recording Protocol (SIPREC) is an open SIP-based protocol for call recording. The standard is defined by Internet Engineering Task Force. It is supported by many phone platforms and call recording system vendors.

The SIPREC standard defines a protocol used to interact between a Session Recording Client (the role generally performed by PBX system or Session Border Controller) and a Session Recording Server (a third party call recorder, in our case a Voicegain-provided SIPREC server). SIPREC opens two RTP streams (one for inbound and one for outbound audio of the call) to the Recording Server. SIPREC protocol also is able to transfer call metadata to the Recorder, this is important so that the recordings can be tied to the information about the calls.

Use Cases

SIPREC is usually used for call recording but the standard essentially provides a real-time audio stream from the telephone call which makes it suitable for applications which have to work real-time like, e.g., agent assist or agent monitoring. Using the SIPREC interface Voicegain can provide real-time transcript of the call as well as perform speech analytics tasks in real time, e.g., keyword and phrase detection, personally-identifiable information scrubbing, sentiment and mood estimation, named-entity recognition, and variety of metrics (like silence, overtalk, etc.).

Audio obtained via SIPREC can also be recorded and transcribed, analyzed, or retrieved at a later time.

Supported Clients

Voicegain SIPREC interface has been tested with the following platforms:

Avaya Enterprise SBC - also supports Avaya AES/TSAPI integration for more call metadata
Broadsoft SIPREC sipua
Cisco built-in bridge (BIB) - built-in bridge functionality is available on some of the Cisco's 3rd generation VoIP Phones and supported by Cisco's UCM version 6.0 and higher.
Cisco Cisco Unified Border Element (CUBE)
Metaswitch SIPREC sipua - The minimal version of Metaswitch that supports SIPREC is 9.0.10
Oracle SBC SIPREC - SelectiveCall Recording SIPREC (oracle.com)
Twilio TwiML <Siprec>

Voicegain can capture relevant call metadata in addition to obtaining the audio (the metadata capture functionality may differ in capabilities depending on the client platform).

Voicegain platform can be configured to automatically launch transcription and speech-analytics as soon as the new SIPREC session gets established.

The output from transcription and speech analytics is available via a Web API. We also support websockets for more convenient streaming of the transcription and/or speech analytics data. SIPREC support is available both in the Cloud and the Edge (OnPrem) deployments of the Voicegain Platform.

SIPREC is an Enterprise feature of the Voicegain platform and is not included in the base package. Please contact support@voicegain.ai or submit a Zendesk ticket for more information about SIPREC and if you would like to use it with your existing Voicegain account.

Notes about Genesys Platform

Genesys Voice Platform does not support SIPREC directly. However, it does support streaming of the inbound and outbound RTP media to two separate SIP endpoints - the end result being pretty much the same as if SIPREC was used. We are currently working on implementing support for this feature of the Genesys Voice Platform for real-time audio streaming to Voicegain Platform. It should be available in Q1 2021.

Announcements

Voice command applications made easier

Jacek Jarmulak

•

min read

•

December 21, 2020

New continuous recognition option

In latest Voicegain release (1.16.0) we have added a new option to our /asr/recognize/async API for ASR/speech-to-text. It is called continuousRecognition and if enabled modifies the default behavior of the grammar-based recognition.

Normally when /asr/recognize/async API is used the recognizer will return when the grammar is matched and the complete timeout expires. That means that it is only possible to get a single recognition in one /asr/recognize/async API request. If a no-match or no-input is detected the recognition will terminate.

However, sometimes there are use cases which demand that the recognizer e.g. ignores all no-matches until a match is found. This is what the continuousRecognition option is for.

With continuousRecognition you have fine control over which of the 4 events - no-input, no-match, match, and error - will be returned in a callback and which (if any) event will terminate recognition. If you do not set any event to terminate recogntion, the recognition session can be stopped by closing the audio stream or by returning stop:true from the callback.

What is it good for?

An example might be a use case where a voicemail is being played to a caller and during the playback we want to interpret caller commands like: stop, next, previous, save, delete. If we used normal recognition we would encounter situations where what is said was not understood. Stopping recognition on no-match would not make much sense because either: (1) re-prompting would mess up the flow of the call, or (2) restarting recognition might introduce a gap in recognition that may result in missing a part what the caller said.

In scenario like this it is best to ignore no-match and continue to listen, the caller will notice no response to what he said and will naturally repeat that.

The settings for continuous recognition that would work in this case would be:

stopOn : match, error
noCallbackFor : no-input, no-match - notes: (1) in this case we suggest setting a noinputTimeout very long so that internally no no-inputs are generated, (2) application could also decide to accept no-match callbacks - they could be tracked and if too numerous acted upon.

Continuous Recognition is supported in Voicegain integration for Twilio Media Streams - either TwiML <Stream> or <Connect><Stream> in Twilio Programmable Voice

It is not yet supported in Voicegain Telephony Bot APIs.

‍

Developers

Python script for testing automated speech recognition (ASR) accuracy

Jacek Jarmulak

•

min read

•

November 2, 2020

Many of our customers have been asking us for help in benchmarking Voicegain speech-to-text recognizer (ASR) on their specific audio files. To make this benchmarking easier we have released a python script that accomplishes just that. With a single command line you can transcribe all audio files from the input directory and compare them against reference transcripts - calculating the WER for each file. You can also do a 2-way comparison of reference vs Voicegain transcript vs Google Speech-to-Text transcript.

The script and the documentation is available at: https://github.com/voicegain/platform/tree/master/utility-scripts/test-transcribe

See our benchmark blog post to give you an idea of what kind of accuracy to expect from the Voicegain recognizer.

‍

Model Training

Custom ASR with Acoustic Model Training - Two Case Studies

Jacek Jarmulak

•

min read

•

September 27, 2020

Updated: Feb 28 2022

In this blog post we describe two case studies to illustrate improvements in speech-to-text or ASR recognition accuracy that can be expected from training of the underlying acoustic models. We trained our acoustic model to recognize Indian and Irish English better.

Case study setup

The Voicegain out-of-the-box Acoustic Model which is available as default on the Voicegain Platform had been trained to recognize mainly US English although our training data set did contain some British English audio. The training data did not contain Indian and Irish English, except for maybe accidental occurrences.

Both case studies were performed in an identical manner:

Training data contained about 300 hours of transcribed speech audio.
Training was done to get improved accuracy on the new type of data but at the same time to also retain the baseline accuracy. An alternative would have been to aim for maximum improvement on the new data at expense of accuracy of the baseline model.
Training was stopped after significant improvement was achieved. It could have been continued to achieve further improvement, although that might have been marginal.
Benchmarks presented here were done on data that was not included in the training set.

Case Study 1: Indian English

Here are the parameters of this study.

We had 250 hours of audio containing male and female speakers, each speaker reading about 50 minutes worth of speech audio.
We separated 6 speakers for the benchmark, selecting 3 male and 3 female samples. Samples were selected to contain both easy, medium, and difficult test cases.

Here are the results of the benchmark before and after training. For comparison. we also include results from Google Enhanced Speech-to-Text.

‍

‍

Some observations:

All 6 test speakers show significant improvement over the original accuracy.
After training the accuracy of 5 speakers is better than Google Enhanced Speech-to-Text. The one remaining speaker improved a lot (from 62% to 76%) but the accuracy was still not as good as Google. We examined the audio and it turns out that it was not recorded properly. The speaker was speaking very quietly and the microphone gain was set very high - this resulted in the audio containing a lot of strange artifacts, like e.g. tongue clicking. The speaker also ready the text in a very unnatural "mechanical" way. Kudos to Google for doing so well on such a bad recording.
On average custom-trained Voicegain speech-to-text was better by about 2% on our Indian English benchmark compared to Google Enhanced recognizer.

Case Study 2: Irish English

Here are the parameters of this study.

We collected about 350 hours of transcribed speech audio from one speaker from Northern Ireland.
For the benchmark we retained some audio from that speaker that was not used for training plus we found audio from 5 other speakers with various types of Irish English accents.

Here are the results of the benchmark before and after training. We also include results from Google Enhanced Speech-to-Text.

‍

‍

Some observations:

The speaker that was used for training is labeled here as 'Legge'. We see huge improvement after training from 76.2% to 88.5% which is significantly above Google Enhanced with 83.9%
The other speaker with over 10% improvement is 'Lucas' which has a very similar accent to 'Legge'.
We looked in detail at the audio of the speaker labeled 'Cairns' who had the least improvement and for whom Google was better than our custom trained recognizer. The audio has significantly lower quality that the other samples plus it contains noticeable echo. Its audio characteristics are quite different from that audio characteristics of the training data used.
On average custom trained Voicegain speech-to-text was better by about 1% on our Irish English benchmark compared to Google Enhanced recognizer.

Further Observations

The amount of data used in training at 250-350 hours was not large given that normally Acoustic Models for speech recognition are trained on 10s of thousands of hours of audio.
The large improvement on 'Legge' speaker suggest that if the goal is to improve recognition on very specific type of speech or speaker the training set could be lower, maybe 50 to 100 hours, to achieve significant improvement.
Bigger training set may be needed - 500 hours or more - in cases where the variability of speech and other audio characteristics is large.

UPDATE Feb 2022

We have published 2 additional studies showing the benefits of Acoustic Model training:

Interested in Voicegain? Take us for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

CPaaS

Large vocabulary transcription for Twilio developers

Jacek Jarmulak

•

min read

•

September 27, 2020

In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.

The reasons we think it will be attractive to Twilio users are:

lower cost per each speech-to-text capture
higher accuracy for customers who choose Acoustic Model customization
access to all speech-to-text hypotheses in word-tree output mode

Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.

Initiating Speech Transcription with Voicegain

This is done by invoking Voicegain async transcribe API: /asr/transcribe/async

Below is an example of the payload needed to start a new transcription session:

‍

Some notes about the content of the request:

we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)
startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
asr settings include the two timeouts used in transcription - no-input, and complete timeouts.

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:

‍

Some notes about the content of the TwiML request:

the websocket URL is the one returned from Voicegain /asr/transcribe/async request
more than one question prompt is supported - they will be played one after another
three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Transcription Response

Below is an example response from the transcription in case where "content" : {"full" : ["transcript"] } .

‍