Our Blog

News, Insights, sample code & more!

ASR
Voicegain MRCP ASR - Quick, Affordable and Simple replacement for the Nuance Recognizer which is rapidly approaching EOL

This article outlines how the modern Voicegain deep-learning based Speech-to-Text/ASR can be a simple and affordable alternative for businesses that are looking for a quick and easy replacement to their on-premise Nuance Recognizer. Nuance has announced that its going to end support for Nuance Recognizer, its grammar-based ASR which uses the MRCP protocol, sometime in 2026 or 2027. So organizations that have a Speech-enabled IVR as their front door to the contact center need to start planning now.

The future belongs to Generative AI powered Voice Agents

With the rise of Generative AI and highly accurate low latency speech-to-text models, the front door of the call center is poised for major transformation. The infamous and highly frustrating IVR phone menu will be replaced by Conversational AI Voicebots; but this will likely happen over the next 3-5 years. As enterprises start to plan their migration journey from these tree-based IVRs to an Agentic AI future, they would like to do this on their timelines. In other words, they do not want to be forced to do this under the pressure of a deadline because of EOL of their vendor.

Staying On-Premise or in a VPC for both the IVR platform and the ASR

In addition, the migration path proposed by Nuance is a multi-tenant cloud offering. While a cloud based ASR/Speech-to-Text engine is likely to make sense for most businesses, there are companies in regulated sectors that are prevented from sending their sensitive audio data to a multi-tenant cloud offering. 

In addition to the EOL announcement by Nuance for their on-premise ASR, a major IVR platform vendor like Genesys has also announced that its premise-based offerings - Genesys Engage and Genesys Connect - will also approach EOL at the same time as the Nuance ASR.

So businesses that want a modern Gen AI powered Voice Assistant but want to keep the IVR on-premise in their datacenter or behind their firewall in a VPC will need to start planning very quickly what their strategy is going to be.

At Voicegain, we allow enterprises that are in this situation and want to remain on-premise or in their VPC with a modern Voicebot platform. This Voicebot platform runs on modern Kubernetes clusters and leverages the latest NVIDIA GPUs. 

Switching Nuance Recognizer with Voicegain is quick and easy!

Rewriting the IVR Application logic to migrate from a tree-based IVR menu to a conversational Voice Assistant is a journey. It would require investments and allocation of resources. Hence a  good first step is to simply replace the underlying Nuance ASR (and possibly the IVR platform too). This will guarantee that a company can migrate to a modern Gen-AI Voice Assistant on its timelines.

Voicegain offers a modern highly accurate deep-learning-based Speech-to-text engine trained on hundreds of thousands of hours of telephone conversations. It is integrated into our native modern telephony stack. It can also talk over the MRCP protocol with VoiceXML based IVR platforms and it supports the traditional Speech grammars (SRGS, JJSGF). Voicegain also supports a range of built-in grammars (like Zipcode, Dates etc).

As a result, it is a simple "drop-in" replacement to the Nuance Recognizer. There is no need to rewrite the current IVR application. Instead of pointing to the IP address of the Nuance Server, the VoiceXML platform just needs to be reconfigured to point to the IP address of the Voicegain ASR server. This should take no more than a couple of minutes.

Voicegain Telephony Bot API - a Callback API for Telephony-based AI Voice Assistant

In addition to the Voicegain ASR/STT engine, we also offer a Telephony Bot API. This is a callback style API that includes our native IVR platform and ASR/STT engine can be used to build Gen AI powered Voicebots. It integrates with leading LLMs - both cloud and open-source premise based - to drive a natural language conversation with the callers.

Talk to us about your IVR migration journey!

If you would like to discuss your IVR migration journey, please email us at sales@voicegain.ai . At Voicegain, we have decades of experience in designing, building and launching conversational IVRs and Voice Assistants.

Here is also a link to more information. Please feel free to schedule a call directly with one of our Co-founders.

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Streaming audio to Voicegain for real-time Speech-to-Text/ASR
Streaming
Streaming audio to Voicegain for real-time Speech-to-Text/ASR

Many applications of speech-to-text (STT) or speech recognition (ASR) require that the conversion from audio to text happen in realtime. These applications could be voice bots, live captioning of videos, events or talks, transcription of meetings, real time speech analytics of sales calls or agent assistance in a contact center.

An important question for developers looking to integrate real time STT into their apps is the choice of the protocol and/or mechanism to stream real time audio to the STT platform. While some STT vendors offer just one method; at  Voicegain we offer multiple choices that developers could select from. In this post, we explore in detail all these methods so that a developer could choose the right one for their specific use case.  

Some of the factors that may guide the specific choice are:

  • Your existing programming language and implementation platform - are there client libraries available in the programming language/ dev platform (whether Java, Javascript, Python, Go, etc) that the app is built on?
  • How audio stream is made available to the app - you application may already be receiving the audio stream in a  particular manner and format.
  • The type of application and its requirements for latency and network resiliency
  • Related to above - the quality of the network between the app and the STT platform.

At Voicegain we currently offer seven different methods/protocols to support streaming to our STT platform. The first three are TCP based methods and the last four methods are  UDP based.

  • TCP based methods are generally a good idea if the quality of network is very robust
  • UDP based methods might be a better choice if the application supports telephony

The Choices

1. WebSockets

Using WebSockets is a simple and popular option to stream audio to Voicegain for speech recognition. WebSockets have been around for a while and most web programming languages have libraries that support it. This option may be the easiest way to get started. Voicegain API is using binary WebSockets, and we have some simple examples to get you started.

2. HTTP 1.1 with Chunked transfer encoding

Voicegain also supports streaming over HTTP 1.1 using chunked transfer encoding. This allows you to send raw audio data with unknown size, which is generally the case for streaming audio. Voicegain supports both pull and push scenarios - we can fetch the audio from a URL that you provide or the application can submit the audio to a URL that we provide. To use this method, your programming language should have libraries that support chunked transfer encoding over HTTP, some of the older or simpler HTTP libraries do not support it.

3. gRPC

gRPC builds on top of HTTP/2 protocol which was designed to support long-running bi-directional connections. Moreover, gRPC uses Protocol buffers which are a more efficient data serialization format compared to JSON that is commonly used in RESTful HTTP APIs. Both these aspects of gRPC allow audio data to be efficiently sent over the same connection that is also used for sending commands and receiving results.

With gRPC, client side libraries can easily be generated for multiple languages, like Java, C#, C++, Go, Python, Node Js, etc. The generated client code contains stubs for use by gRPC clients to call the methods defined by the service.

Using gRPC, clients can invoke the Voicegain STT APIs like a local object whose methods expose the APIs.  This method is a fast, efficient, and low-latency way to stream audio to Voicegain and receive recognition responses. The responses are sent over the same connection back from the server to client - this removes the need for polling or callbacks to get the results when using HTTP.

gRPC is great when used from the back-end code or from Android. It is not a plug and play solution when used from Web Browsers but requires some extra steps.

UDP Based Methods

The first three methods described above are TCP based methods. They work great for audio streaming as long as the connection has no or minimal packet loss. Packet loss causes significant delays and jitter in the TCP connections. This may be fine if audio does not have to be processed truly real-time and can be buffered.  

If real-time behavior is important and the network is known to be unreliable, the UDP protocol is a better alternative to TCP for audio streaming. With UDP, packet loss will manifest itself as audio dropouts, but that may be preferable to excessive pauses and jitter in case of TCP.

4. RTP protocol with Voicegain extensions

RTP is a standard protocol for audio streaming over UDP. However, RTP itself is is generally not sufficient and is normally used with accompanying RTP Control Protocol (RTCP). Voicegain has implemented its own variation of RTCP that can be used to control RTP audio streams sent to the recognizer.

Currently, the only way to to stream audio using RTP to Voicegain platform is to use our proprietary Audio Sender Java library. We also provide Audio Sender Daemon that is capable of reading data directly from audio devices and streaming it to Voicegain for real time transcription.

5. SIP/RTP

If you are looking to invoke Speech-to-text in a contact center,  Voicegain offers Telephony Bot APIs. You can read more about them here. Essentially the Voicegain platform can act as a SIP endpoint and can be invited into a SIP session. We can do two things 1) As part of an IVR or Bot, play prompts and gather caller input 2) As part of a real-time agent assist, we can listen & transcribe the agent-caller interaction.

To elaborate on (1), with these APIs you can invite the Voicegain platform into a SIP session which provides Voicegain Speech-to-Text engine access to the audio. Once the audio stream gets established, you can issue commands to recognize call utterances and receive the recognition response using our web callbacks. You can write the logic of your application using any programming language or an NLU Engine of your choice - all that is needed is being able to handle HTTP requests and send responses.

Voicegain platform in this scenario essentially acts as a 'mouth' and an 'ear' to the entire conversation which happens over SIP/RTP. The application can issue JSON commands over HTTP that play prompts and convert caller speech into text through the entire duration of the call over a single session. You can also record the entire conversation if the call is transferred to a live agent and transcribe into text.

6. MRCP

Contact center platform vendors like Cisco, Genesys,  Avaya and FreeSWITCH based CCaaS platforms usually support MRCP to connect to Speech Recognition engines. Voicegain supports access over MRCP to both large vocabulary and grammar based speech recognition. We recommend MRCP only for Edge, Private Cloud or On-premise deployments

7. SIPREC

In Contact Centers, for real-time transcription of the agent caller interaction, Voicegain supports SIPREC. Further information is provided here.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account  and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

Read more → 
Voicegain releases Telephony Bot APIs for IVRs and Voice Bots
Voice Bot
Voicegain releases Telephony Bot APIs for IVRs and Voice Bots

Update Dec 2020: We have renamed RTC Callback APIs to Telephony Bot APIs to better reflect how developers can use these APIs  - which is build Voice Bots, IVRs.


If you have wanted to voice enable your Chatbot or build your own Telephony based Voice Bot or a Speech-enabled IVR, Voicegain has built an API that is really cool  - Release 1.12.0 of Voicegain Speech-to-Text Platform now includes Telephony Bot APIs (formerly called RTC Callback APIs in the past).

Voicegain Telephony Bot APIs enables any NLU/Bot Framework to easily integrate with PSTN/telephony infrastructure using either (a) SIP INVITE of Voicegain platform from a CPaaS platform of your choice or (b) purchasing a phone number directly from Voicegain portal and pointing it to your Bot. You can then use these callback style APIs to (i) play prompts (ii) recognize speech utterances or DTMF digits (iii) allow for barge-in and several other exciting features. We offer sample code that will help you easily integrate a Bot Framework of your choice to our Telephony Bot APIs.


If you do not have a Bot Framework, thats okay too. You can write the logic in any backend programming language (Python, Java or Node.JS) that can serialize responses in a JSON format and interact with our Callback style APIs.  Voicegain also offers a declarative YAML format to define the call flow and you can host this YAML file logic and interact with these APIs. Developers can also code and deploy the application logic in a server-less computing environment like Amazon Lambda.


Many enterprises - in banking, financial services , health care, telecom and retail  - are stuck with legacy telephony based IVRs that are approaching obsolescence.

Voicegain's Telephony Bot APIs provide a great future-proof upgrade path for such enterprises. Since these APIs are based on web callbacks, they can interact with any backend programming language. So any backend web developer can design, build and maintain such apps.


Why should you use Telephony Bot APIs?

With Telephony Bot APIs, integration becomes much simpler for developers.

1) You can SIP INVITE the Voicegain Speech-to-Text/ASR platform to a SIP/RTP session for as long as is needed. We support SIP integration with CPaaS platforms like Twilio, Signalwire and Telnyx. We also support CCaaS platforms like Genesys, Cisco and Avaya.

2) We also support direct phone number ordering and SIP Trunks from the Voicegain Web Console. More integrations will be added soon.

Telephony Bot APIs are based on web callbacks where the actual program/ implementation is on the Client side and the Voicegain Telephony Bot APIs  define the Requests and Responses. The meaning of Requests and Responses is reversed w.r.t what you would see in a normal Web API:

  • Responses provide the commands, while
  • Requests provide the outcome of those commands.

Illustrated example of Telephony Bot API in action

Below is an example of a simple phone call interaction which is controlled by Telephony Bot API. The sequence diagram shows 4 callbacks during a toy survey call:

  • Req 1: Phone Call arrived
  • Resp 1: Say: "Welcome"
  • Req 2: Done saying "Welcome"
  • Resp 2: Ask: "Are you happy", bind reply to happy var
  • Req 3: Caller's answer was "yes", happy=YES
  • Resp 3: Disconnect
  • Req 4: Disconnected
  • Resp 4: We are done


Currently supported actions

Telephony Bot API supports 4 types of actions:

  • output: say something - TTS with a choice of 8 different voices is supported
  • input: ask question - both speech input and DTMF are supported. For speech input you can use GRXML, JSGF or built-in grammars
  • transfer: transfer a call to a phone destination
  • disconnect: end the call

Wait, there is more

Each call can be recorded (two channel recording) and then transcribed. The recording and the transcript can be accessed from the portal as well as via the API.

Roadmap

Features coming soon:

  • record Callback action - you can use it to implement voicemail or record other types of messages
  • transfer to a sip destination
  • input - allow choice of large vocabulary speech-to-text in addition to grammars - use the captured text in your NLU
  • answer call at a sip address - instead of a phone number
  • WebRTC support
  • outbound dialing

Read more → 
Python SDK Available
Developers
Python SDK Available

As of August 5th, 2020, programming in Python against Voicegain Speech-to-Text (STT) API got even easier with the release of official voicegain-speech package to  Python Package Index (PyPI) repository.


The SDK package is available at: https://pypi.org/project/voicegain-speech/

The SDK source code is available at: https://github.com/voicegain/python-sdk


This package wraps Voicegain Speech-to-Text Web API. A preview of the API spec can be found at: https://www.voicegain.ai/api

Full API spec documentation is available at: https://console.voicegain.ai/api-documentation


The core APIs are for Speech-to-Text, either transcription or recognition (further described below).Other available APIs include:

  • RTC Callback APIs which in addition to speech-to-text allow for control of RTC session (e.g., a telephone call).
  • Websocket APIs for managing broadcast websockets used in real-time transcription.
  • Language Model creation and manipulation APIs.
  • Data upload APIs that help in certain STT use scenarios.
  • Training Set APIs - for use in preparing data for acoustic model training.
  • GREG APIs - for working with ASR and Grammar tuning tool - GREG.

Transcribe API

/asr/transcribeThe Transcribe API allows you to submit audio and receive the transcribed text word-for-word from the STT engine. This API uses our Large Vocabulary language model and supports long form audio in async mode.

The API can, e.g., be used to transcribe audio data - whether it is podcasts, voicemails, call recordings, etc. In real-time streaming mode it can, e.g., be used for building voice-bots (your the application will have to provide NLU capabilities to determine intent from the transcribed text).

The result of transcription can be returned in four formats:

  • Transcript - Contains the complete text of transcription
  • Words - Intermediate results will contain new words, with timing and confidences, since the previous intermediate result. The final result will contain complete transcription.
  • Word-Tree - Contains a tree of all feasible alternatives. Use this when integrating with NL postprocessing to determine the final utterance and its meaning.
  • Captions - Intermediate results will be suitable to use as captions (this feature is in beta).

Recognize API

/asr/recognizeThis API should be used if you want to constrain STT recognition results to the speech-grammar that is submitted along with the audio (grammars are used in place of the large vocabulary language model).

While having to provide grammars is an extra step (compared to Transcribe API), they can simplify the development of applications since the semantic meaning can be extracted along with the text.

Another advantage of using grammars is that they can ignore words in the utterance that are outside of grammar - still delivering recognition although with lower confidence.

Voicegain supports grammars in the JSGF and GRXML formats – both grammar standards used by enterprises in IVRs since early 2000s.The recognize API only supports short form audio - no more than 60 seconds.


Read more → 
CORS Support Added in 1.9.0
Developers
CORS Support Added in 1.9.0

We have recently added support  for CORS (Cross Origin Resource Sharing) in our APIs. This was in response to our customers asking for it in order to enable them building Speech-to-Text web applications with minimal effort. By making web API requests to Voicegain Speech API directly from their web clients the application can be simpler and more efficient.

Examples of simple applications that our customers are implementing this way are: microphone input capture and transcription (e.g. to capture and transcribe meeting notes), or offline-audio file transcription.

Users have full control, via security settings, over which Origins should be allowed to make the CORS requests.

Read more → 
Competitive Advantage of Custom Acoustic Models
Model Training
Competitive Advantage of Custom Acoustic Models

There is no doubt that there is a lot of value in the datasets that are used to train AI models. That is one of the reasons why Google offers their Speech-to-Text service at two price points, one with 'data logging' and and one without, see table below.



However at Voicegain, our speech-to-text platform does not capture or use any customer data (while still being able to offer low ASR pricing).

Moreover, Voicegain platform enables our customers to use their data to train their own dedicated & custom Acoustic Models. As result, our customers benefit in two ways:

  • The accuracy of these custom acoustic model(s) is several % higher compared to our base models.
  • Custom models are licensed exclusively to the clients and are not shared with anyone (neither Voicegain, nor any other Voicegain customers), so this higher accuracy translates directly into competitive advantage.

By retaining ownership of the data and the custom acoustic models, our customers benefit from higher ASR accuracy in general, and higher accuracy than their potential competitors in particular.

Read more → 
How AI powered Speech can boost Contact Center BPO topline?
Insights
How AI powered Speech can boost Contact Center BPO topline?

Senior leadership teams at most global contact center outsourcers are constantly under pressure. They need to have a laser like focus on key metrics, SLAs and people to manage their businesses. They are increasingly managing a global distributed business that is both labor intensive and technology intensive. And they have to do all of this with increasingly tight margins.

Despite being measured on metrics like CSAT and NPS, a lot of the value that an outsourcer delivers to its clients is often hard to quantify. And too often the price realized by the outsourcer does not capture the value and quality an outsourcer provides.

Two Ideas to pivot into high value SaaS offerings

In this article I would like to propose two new innovative ideas that can help Contact Center BPOs pivot into new SaaS (Software-as-a-Service) revenues.

  1. CX Speech Insights Service: Develop a new branded realtime CX insights service based on speech analytics powered by deep learning.
  2. CX Speech Automation Service: Build new voice self-service applications that can automate some of the common customer care scenarios.

Both these offerings can be offered to the clients using a Software-as-a-Service (SaaS) based business model in conjunction with the traditional agent side of the business.


Both these SaaS offerings leverage some of the key strengths that BPOs have: Deep domain expertise, in depth understanding of customer issues and technology infrastructure that leverages both

1. CX Speech Insights Service

Contact centers have a treasure trove of audio data. Every day associates are handling thousands of calls across a wide variety of topics. While outsourcers use legacy speech analytics vendors, the traditional use has been to analyze a sample of calls to assist in the Quality Assurance function. Net-net, it is viewed as a cost center both for the outsourcers and their clients.

However there is a massive untapped opportunity to mine and extract insights from such audio data for uses well beyond quality assurance. Such insights may be relevant to stakeholders in Product and Marketing teams of the clients. This can open up new non-traditional product and marketing budgets for BPOs.

2. CX Speech Automation Service

Outsourcers have an in-depth deeper understanding of current topics that customers are calling about. They have unique and current insights into which categories of calls are actually driving volumes. With the right tools, methodologies and personnel, outsourcers can build and offer new innovative speech self service applications that may automate parts of calls. With the right technologies, outsourcers can move seamlessly between agent assisted calls and automated self-service interactions.

The Foundation: Deep Neural Networks & custom acoustic models

The foundation for these SaaS offerings are modern Deep Neural Network (DNN) based Speech to Text platforms.

The old speech to text were technologies were based on traditional statistical models (called HMMs and GMMs). They were limited in their ability to train on specific industry jargons and accents. But a DNN based platform has the following advantages

  1. A DNN based platform can be easily trained to recognize unique words/jargon, accents and noisy backgrounds. Training the models increases the quality of recognition and makes it accurate enough to deliver real value to client stakeholders.
  2. A industry or customer specific acoustic model has the potential to create intellectual property for the BPO.
  3. A DNN platform can be used equally well both in the up front automation part and in the analytics and notification service. There are benefits from using the same platform for both offerings.

For more info, please contact us at info@voicegain.ai.


Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control