Blog | Speech-to-Text Platform

Contact Center

Voicegain Acquires TrampolineAI to deliver End-to-End Contact Center AI for Healthcare Payers

Arun Santhebennur

•

min read

•

January 7, 2026

New unified platform combines AI voice agent automation with Real-time agent assistance and Auto QA, enabling healthcare payers to reduce average handle time (AHT) and improve first contact resolution (FCR) in their call centers.

IRVING, Texas and SAN FRANCISCO, Jan. 7, 2026 /PRNewswire-PRWeb/ -- Voicegain, a leader in AI Voice Agents and Infrastructure, today announced the acquisition of TrampolineAI, a venture-backed healthcare payer-focused Contact Center AI company whose products supports thousands of member interactions. The acquisition unifies Voicegain's AI Voice Agent automation with Trampoline's real-time agent assistance and Auto QA capabilities, enabling healthcare payers to optimize their entire contact center operation—from fully automated interactions to AI-enhanced human agent support.

Healthcare payer contact centers face mounting pressure to reduce costs while improving member experience. The reasons vary from CMS pressure, Medicaid redeterminations, Medicare AEP volume and staffing shortages. The challenge lies in balancing automation for routine inquiries with personalized support for complex interactions. The combined Voicegain and TrampolineAI platform addresses this challenge by providing a comprehensive solution that spans the full spectrum of contact center needs—automating high-volume routine calls while empowering human agents with real-time intelligence for interactions that require specialized attention.

"We're seeing strong demand from healthcare payers for a production-ready Voice AI platform. TrampolineAI brings deep payer contact center expertise and deployments at scale, accelerating our mission at Voicegain." — Arun Santhebennur

Over the past two years, Voicegain has scaled Casey, an AI Voice Agent purpose-built for health plans, TPAs, utilization management, and other healthcare payer businesses. Casey answers and triages member and provider calls in health insurance payer call centers. After performing HIPAA validation, Casey automates routine caller intents related to claims, eligibility, coverage/benefits, and prior authorization. For calls requiring live assistance, Casey transfers the interaction context via screen pop to human agents.

TrampolineAI has developed a payer-focused Generative AI suite of contact center products—Assist, Analyze, and Auto QA—designed to enhance human agent efficiency and effectiveness. The platform analyzes conversations between members and agents in real-time, leveraging real-time transcription and Gen AI models. It provides real-time answers by scanning plan documents such as Summary of Benefits and Coverage (SBCs) and Summary Plan Descriptions (SPDs), fills agent checklists automatically, and generates payer-optimized interaction summaries. Since its founding, TrampolineAI has established deployments with leading TPAs and health plans, processing hundreds of thousands of member interactions.

"Our mission at Voicegain is to enable businesses to deploy private, mission-critical Voice AI at scale," said Arun Santhebennur, Co-founder and CEO of Voicegain. "As we enter 2026, we are seeing strong demand from healthcare payers for a comprehensive, production-ready Voice AI platform. The TrampolineAI team brings deep expertise in healthcare payer operations and contact center technology, and their solutions are already deployed at scale across multiple payer environments."

Through this acquisition, Voicegain expands the Casey platform with purpose-built capabilities for payer contact centers, including AI-assisted agent workflows, real-time sentiment analysis, and automated quality monitoring. TrampolineAI customers gain access to Voicegain's AI Voice Agents, enterprise-grade Voice AI infrastructure including real-time and batch transcription, and large-scale deployment capabilities, while continuing to receive uninterrupted service.

"We founded TrampolineAI to address the significant administrative cost challenges healthcare payers face by deploying Generative Voice AI in production environments at scale," said Mike Bourke, Founder and CEO of TrampolineAI. "Joining Voicegain allows us to accelerate that mission with their enterprise-grade infrastructure, engineering capabilities, and established customer base in the healthcare payer market. Together, we can deliver a truly comprehensive solution that serves the full range of contact center needs."

A TPA deploying TrampolineAI noted the platform's immediate impact, stating that the data and insights surfaced by the application were fantastic, allowing the organization to see trends and issues immediately across all incoming calls.

The combined platform positions Voicegain to deliver a complete contact center solution spanning IVA call automation, real-time transcription and agent assist, Medicare and Medicaid compliant automated QA, and next-generation analytics with native LLM analysis capabilities. Integration work is already in progress, and customers will begin seeing benefits of the combined platform in Q1 2026.

Following the acquisition, TrampolineAI founding team members Mike Bourke and Jason Fama have joined Voicegain's Advisory Board, where they will provide strategic guidance on product development and AI innovation for healthcare payer applications.

The terms of the acquisition were not disclosed.

About Voicegain

Voicegain offers healthcare payer-focused AI Voice Agents and a private Voice AI platform that enables enterprises to build, deploy, and scale voice-driven applications. Voicegain Casey is designed specifically for healthcare payers, supporting automated and assisted customer service interactions with enterprise-grade security, scalability, and compliance. For more information, visit voicegain.ai.

About TrampolineAI

TrampolineAI was a venture-backed voice AI company focused on healthcare payer solutions. The company applies Generative Voice AI to contact centers to improve operational efficiency, member experience, and compliance through real-time agent assist, sentiment analysis, and automated quality assurance technologies. For more information, visit trampolineai.com.

Media Contact:

Arun Santhebennur

Co-founder & CEO, Voicegain

press@voicegain.ai

Media Contact

Arun Santhebennur, Voicegain, 1 9725180863 701, arun@voicegain.ai, https://www.voicegain.ai

SOURCE Voicegain

‍

Developers

Voicegain Speech-to-Text integrates with Twilio Media Streams

Jacek Jarmulak

•

min read

•

September 7, 2020

Voicegain launched an extension to Voicegain /asr/recognize API that supports Twilio Media Streams via TwiML <Connect><Stream>. With this launch, developers using Twilio's Programmable Voice get an accurate, affordable, and easy to use ASR to build Voice Bots /Speech-IVRs.

Update: Voicegain also announced that its large vocabulary transcription (/asr/transcribe API) integrates with Twilio Media Streams. Developers may use this to voice enable a chat bot developed on any bot platform or develop a real-time agent assist application.

Key Features of Twilio Media Streams support

Voicegain Twilio Media Streams support gives developers the following features:

Grammar Support for bots & IVRs: Developers can now write voice bots or ivrs that use grammars. Use of grammars can improve recognition accuracy and simplify bot development by constraining the speech-to-text engine. Also many traditional VoiceXML IVRs are built using grammars. Until now Twilio TwiML did not support use of speech grammars as the <Gather> command supports only text capture. This made it hard to build simple bots or migrate existing VoiceXML IVR applications to the Twilio platform. Mapping of text to semantic meaning had to be done separately, plus large vocabulary recognizer was more likely to return spurious recognitions. Voicegain solves these problems by supporting both GRXML and JSGF speech grammars at the core speech-to-text (ASR) engine level. This delivers higher accuracy compared to an ASR that uses a large vocabulary language model to recognize text and then applies grammars to the recognized text.
90% Savings on ASR Licensing costs: A big advantage for developers of the Twilio Programmable Voice platform has been its affordable pricing. However, that was not necessarily true for existing ASR options like <Gather> that is priced at 8 cents/minute (with a 15 second minimum). With Voicegain the ASR/STT price is 1.25 cents/ minute measured at 1 second increments. If you include the billing increment, developers get 90% cost savings.
Better Timeout Support: Voicegain supports configurable timeouts for no-input, complete timeout and incomplete timeout. Because the grammar is integrated with the recognizer, Voicegain ASR is able to deliver accurate complete timeout response which is not possible with <Gather> command for which the only way to tell if the caller stopped speaking is a large enough pause.
Simplify dynamic prompt playback. -- In order to make use of <Connect><Stream> as easy as possible, we support passing prompts when invoking <Stream>. Prompts can be provided either as text or as URLs. If provided as text then Voicegain will either use TTS or perform dynamic concatenation of prerecorded prompts. A prompt manager for such prerecorded prompts is provided as part of Voicegain Web Portal. Configurable barge-in is supported for the prompts.
Fine-tune and test grammars. -- Voicegain Web Portal includes a tool for reviewing and fine tuning grammars. The tool also supports regression tests. With this functionality you will never have to deploy grammars without knowing how well they are going to perform after changes.

How Twilio Media Streams works with Voicegain

‍

‍

TwiML <Stream> requires a websocket url. This url can be obtained by invoking Voicegain /asr/recognize/async API. When invoking this API the grammar to be used in the recognition has to be provided. The websocket URL will be returned in the response.

‍

In addition to the wss url, Custom Parameters within <Connect><Stream> command are used to pass information about the question prompt to be played to the caller by Voicegain. This can be a text or a url to a service that will provide the audio.

Once <Connect><Stream> has been invoked, Voicegain platform takes over- it:

Plays the prompt via the back channel of <Stream>
As soon as caller starts speaking, the prompt playback is stopped (if it it was still playing) exactly like in <Gather>
Spoken words are are recognized using grammar. Recognition result is then provided as a callback from Voicegain Platform. In case of no-input or no-match an appropriate callback will also be made.
<Stream> connection is stopped and the TwiML application will continue with a next command.

BTW, we also support DTMF input as an alternative to speech input.

[UPDATE: you can see more details of how to use Voicegain with Twilio Media Streams in this new Blog post.]

Other features of the Voicegain Platform

1. On Premise Edge Support: While Voicegain APIs are available as a cloud PaaS service, Voicegain also supports OnPrem/Edge deployment. Voicegain can be deployed as a containerized service on a single node Kubernetes cluster, or onto multi-node high-availability Kubernetes cluster (on your GPU hardware or your VPC).

2. Acoustic model customization: This allows to achieve very high accuracy beyond what is possible with out of the box recognizers. The grammar tuning and regression tool mentioned earlier, can be used to collect training data for acoustic model customization.

More Features Coming

On our near-term roadmap for Twilio users we have several more features:

Advanced Answering Machine Detection (AMD) -- will be invoked using <Connect><Stream> and will provide very accurate answering machine detection using speech recognition.
Large vocabulary language model to just capture the spoken words (no grammars are used) and integrate with any NLU Engine of your choice. We think it will be attractive because of the lower cost compared to <Gather>.
Real-time agent assist - we are combining our real-time speech recognition with speech analytics to deliver an API that will allow for building real-time agent assist and monitoring applications.

You can sign up to try our platform. We are offering 600 minutes of free monthly use of the platform. If you have questions about integration with Twilio, send us a note at support@voicegain.ai.

Twilio, TwiML and Twilio Programmable Voice are registered trademarks of Twilio, Inc

‍

Voice Bot

Building Voice Bots: Should you always use an NLU engine?

Arun Santhebennur

•

min read

•

August 31, 2020

Businesses of all sizes are looking to develop Voicebots to automate customer service calls or voice based sales interactions. These bots may be voice versions of existing Chatbots, or exclusively voice based bots. While Chatbots automate routine transactions over the web, many users like the ability to use voice (app or phone) when it is convenient.

A voice bot dialog consists of multiple interactions where a single interaction typically involves 3 steps:

A caller/customer's spoken utterance is converted into text
Intent is extracted from the transcribed text
Next step of the conversation is determined based on the intent extracted and the current state/context of the conversation.

For the first step, developers use a Speech-to-Text platform to transcribe the spoken utterance into text. ASR or Automatic Speech Recognition is another term that is used to describe the same type of software.

When it comes to extracting intent from the customer utterance, they typically use an NLU engine. This is understandable because developers would like to re-use the dialog flow or conversation turns programmed in their Chatbot App for their Voicebot.

A second option is to use Speech Grammars which match the spoken utterance and assign meaning (intent) to it. This option is not in vogue these days but Speech Grammars have been successfully used in telephony IVR systems that supported speech interaction using ASR.

This article explores both approaches to building Voicebots.

The NLU approach

Most developers today use the NLU approach as a default option for Steps 2 and 3. Popular NLU engines include Google Dialog Flow, Microsoft LUIS, Amazon Lex and also increasingly an open source framework like RASA.

An NLU Engine helps developers configure different intents that match training phrases, specify input and output contexts that are associated with these intents, and define actions that drive the conversation turns. This method of development is very powerful and expressive. It allows you to build bots that are truly conversational. If you use NLU to build a Chatbot you can generally reuse its application logic for a Voicebot.

But it has a significant drawback. You need to hire highly skilled natural language developers. Designing new intents, handling input and output contexts, entities etc is not easy. Since you require skilled developers, the development of bots using NLU is expensive. It is not just expensive to build but it is costly to maintain too. For example, if you want to add new skills to the bot that are beyond its initial set of capabilities, modifying the contexts is not an easy process.

Net-net the NLU approach is a really good fit if (a) you want to develop a sophisticated bot that can support a truly conversational experience (b) you are able to hire and engage skilled NLP developers and (c) you have adequate budgets to develop such bots.

The Speech Grammar approach

One approach that was used in the past and seems to have been forgotten these days is the use of Speech Grammars. Grammars were used extensively to build traditional telephony based speech IVRs for over 20 years now, but most NLP and web developers are not aware of them.

A Speech Grammar provides either a list of all utterances that can be recognized, or, more commonly, a set of rules that can generate the utterances that can be recognized. Such grammar combines two functions:

it provides a language model that guides the speech-to-text engine in evaluating the hypotheses, and
it can attach semantic meaning to the recognized text utterances.

The second function is achieved by attaching tags to the rules in the grammars. Tag formats exist that support complex expressions to be evaluated for grammars that have many nested rules. These tags allow the developer to essentially code intent extraction right into the grammar.

Also Step 3 - which is the dialog/conversation flow management - can be implemented in any backend programming language - Java, Python or Node.js. Developers of voice bots that are on a budget and are looking to building a simple bot with just a few intents should strongly consider grammars as an alternative approach to NLU.

NLU and Speech Grammar compared

Advantages of NLU

NLU can be applied to text that has been written as well as text coming from speech-to-text engine. This allows in principle for the same application logic to be used for both a Chatbot and a Voicebot. Speech Grammars are not good at ignoring input text that does not match the grammar rules. This makes Speech Grammars not directly applicable to Chatbots, though ways have been devised to allow Speech Grammar to do "fuzzy matching".
A well trained NLU can capture correct intents in more complex situations than a Speech Grammar. Note, however, that some of the NLU techniques could be used to automatically generate grammars with tags that could be a close match for NLU performance.

Advantages of Grammars

NLU intent recognition may suffer if the speech-to-text conversion was not 100% correct. We have seen reports of combined Speech-to-Text+NLU accuracy being very low (down to just 70%) in some use cases. Speech Grammars, on the other hand, are used as a language model while evaluating speech hypotheses. This allows the recognizer to still deliver correct intents even when the spoken phrase does not match the grammar exactly - the recognition result will have lower confidence but will still be usable.
Speech grammars are simple to build and use. Also, there is no need to integrate NLU system with Speech-to-Text system. All the work can be performed by the Speech-to-Text engine

Our Recommendation

Voicegain is one of the few Speech-to-Text or ASR engines that supports both approaches.

Developers can easily integrate Voicegain's large vocabulary speech-to-text (Transcribe API) with any popular NLU engine. One advantage that we have here is the ability to output multiple hypotheses - when using the word-tree output mode. This allows multiple NLU intent matches to be done of the different speech hypotheses with the goal of determining if the there is an NLU consensus in spite of differing speech-to-text output. This approach can deliver higher accuracy.

We also provide our Recognize API and RTC Callback APIs ; both of these support speech grammars. Developers may code the application flow/dialog of the voicebot in any backend programming language - Java, Python, Node.Js. We have extensive support for telephony protocols like SIP/RTP and we support WebRTC.

Most other STT engines - including Microsoft, Amazon and Google - do not support grammars. This may have something to do with the fact that they are also trying to promote their NLU engines for chatbot applications.

If you are building a Voicebot and you'd like to have a discussion on which approach suits you, do not hesitate to get in touch with us. You can email us at info@voicegain.ai.

‍

Streaming

Streaming audio to Voicegain for real-time Speech-to-Text/ASR

Arun Santhebennur

•

min read

•

August 23, 2020

Many applications of speech-to-text (STT) or speech recognition (ASR) require that the conversion from audio to text happen in realtime. These applications could be voice bots, live captioning of videos, events or talks, transcription of meetings, real time speech analytics of sales calls or agent assistance in a contact center.

An important question for developers looking to integrate real time STT into their apps is the choice of the protocol and/or mechanism to stream real time audio to the STT platform. While some STT vendors offer just one method; at Voicegain we offer multiple choices that developers could select from. In this post, we explore in detail all these methods so that a developer could choose the right one for their specific use case.

Some of the factors that may guide the specific choice are:

Your existing programming language and implementation platform - are there client libraries available in the programming language/ dev platform (whether Java, Javascript, Python, Go, etc) that the app is built on?
How audio stream is made available to the app - you application may already be receiving the audio stream in a particular manner and format.
The type of application and its requirements for latency and network resiliency
Related to above - the quality of the network between the app and the STT platform.

At Voicegain we currently offer seven different methods/protocols to support streaming to our STT platform. The first three are TCP based methods and the last four methods are UDP based.

TCP based methods are generally a good idea if the quality of network is very robust
UDP based methods might be a better choice if the application supports telephony

The Choices

1. WebSockets

Using WebSockets is a simple and popular option to stream audio to Voicegain for speech recognition. WebSockets have been around for a while and most web programming languages have libraries that support it. This option may be the easiest way to get started. Voicegain API is using binary WebSockets, and we have some simple examples to get you started.

2. HTTP 1.1 with Chunked transfer encoding

Voicegain also supports streaming over HTTP 1.1 using chunked transfer encoding. This allows you to send raw audio data with unknown size, which is generally the case for streaming audio. Voicegain supports both pull and push scenarios - we can fetch the audio from a URL that you provide or the application can submit the audio to a URL that we provide. To use this method, your programming language should have libraries that support chunked transfer encoding over HTTP, some of the older or simpler HTTP libraries do not support it.

3. gRPC

gRPC builds on top of HTTP/2 protocol which was designed to support long-running bi-directional connections. Moreover, gRPC uses Protocol buffers which are a more efficient data serialization format compared to JSON that is commonly used in RESTful HTTP APIs. Both these aspects of gRPC allow audio data to be efficiently sent over the same connection that is also used for sending commands and receiving results.

With gRPC, client side libraries can easily be generated for multiple languages, like Java, C#, C++, Go, Python, Node Js, etc. The generated client code contains stubs for use by gRPC clients to call the methods defined by the service.

Using gRPC, clients can invoke the Voicegain STT APIs like a local object whose methods expose the APIs. This method is a fast, efficient, and low-latency way to stream audio to Voicegain and receive recognition responses. The responses are sent over the same connection back from the server to client - this removes the need for polling or callbacks to get the results when using HTTP.

gRPC is great when used from the back-end code or from Android. It is not a plug and play solution when used from Web Browsers but requires some extra steps.

UDP Based Methods

The first three methods described above are TCP based methods. They work great for audio streaming as long as the connection has no or minimal packet loss. Packet loss causes significant delays and jitter in the TCP connections. This may be fine if audio does not have to be processed truly real-time and can be buffered.

If real-time behavior is important and the network is known to be unreliable, the UDP protocol is a better alternative to TCP for audio streaming. With UDP, packet loss will manifest itself as audio dropouts, but that may be preferable to excessive pauses and jitter in case of TCP.

4. RTP protocol with Voicegain extensions

RTP is a standard protocol for audio streaming over UDP. However, RTP itself is is generally not sufficient and is normally used with accompanying RTP Control Protocol (RTCP). Voicegain has implemented its own variation of RTCP that can be used to control RTP audio streams sent to the recognizer.

Currently, the only way to to stream audio using RTP to Voicegain platform is to use our proprietary Audio Sender Java library. We also provide Audio Sender Daemon that is capable of reading data directly from audio devices and streaming it to Voicegain for real time transcription.

5. SIP/RTP

If you are looking to invoke Speech-to-text in a contact center, Voicegain offers Telephony Bot APIs. You can read more about them here. Essentially the Voicegain platform can act as a SIP endpoint and can be invited into a SIP session. We can do two things 1) As part of an IVR or Bot, play prompts and gather caller input 2) As part of a real-time agent assist, we can listen & transcribe the agent-caller interaction.

To elaborate on (1), with these APIs you can invite the Voicegain platform into a SIP session which provides Voicegain Speech-to-Text engine access to the audio. Once the audio stream gets established, you can issue commands to recognize call utterances and receive the recognition response using our web callbacks. You can write the logic of your application using any programming language or an NLU Engine of your choice - all that is needed is being able to handle HTTP requests and send responses.

Voicegain platform in this scenario essentially acts as a 'mouth' and an 'ear' to the entire conversation which happens over SIP/RTP. The application can issue JSON commands over HTTP that play prompts and convert caller speech into text through the entire duration of the call over a single session. You can also record the entire conversation if the call is transferred to a live agent and transcribe into text.

6. MRCP

Contact center platform vendors like Cisco, Genesys, Avaya and FreeSWITCH based CCaaS platforms usually support MRCP to connect to Speech Recognition engines. Voicegain supports access over MRCP to both large vocabulary and grammar based speech recognition. We recommend MRCP only for Edge, Private Cloud or On-premise deployments

7. SIPREC

In Contact Centers, for real-time transcription of the agent caller interaction, Voicegain supports SIPREC. Further information is provided here.

Take Voicegain for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

‍

Voice Bot

Voicegain releases Telephony Bot APIs for IVRs and Voice Bots

Jacek Jarmulak

•

min read

•

August 6, 2020

Update Dec 2020: We have renamed RTC Callback APIs to Telephony Bot APIs to better reflect how developers can use these APIs - which is build Voice Bots, IVRs.

If you have wanted to voice enable your Chatbot or build your own Telephony based Voice Bot or a Speech-enabled IVR, Voicegain has built an API that is really cool - Release 1.12.0 of Voicegain Speech-to-Text Platform now includes Telephony Bot APIs (formerly called RTC Callback APIs in the past).

Voicegain Telephony Bot APIs enables any NLU/Bot Framework to easily integrate with PSTN/telephony infrastructure using either (a) SIP INVITE of Voicegain platform from a CPaaS platform of your choice or (b) purchasing a phone number directly from Voicegain portal and pointing it to your Bot. You can then use these callback style APIs to (i) play prompts (ii) recognize speech utterances or DTMF digits (iii) allow for barge-in and several other exciting features. We offer sample code that will help you easily integrate a Bot Framework of your choice to our Telephony Bot APIs.

If you do not have a Bot Framework, thats okay too. You can write the logic in any backend programming language (Python, Java or Node.JS) that can serialize responses in a JSON format and interact with our Callback style APIs. Voicegain also offers a declarative YAML format to define the call flow and you can host this YAML file logic and interact with these APIs. Developers can also code and deploy the application logic in a server-less computing environment like Amazon Lambda.

Many enterprises - in banking, financial services , health care, telecom and retail - are stuck with legacy telephony based IVRs that are approaching obsolescence.

Voicegain's Telephony Bot APIs provide a great future-proof upgrade path for such enterprises. Since these APIs are based on web callbacks, they can interact with any backend programming language. So any backend web developer can design, build and maintain such apps.

Why should you use Telephony Bot APIs?

With Telephony Bot APIs, integration becomes much simpler for developers.

1) You can SIP INVITE the Voicegain Speech-to-Text/ASR platform to a SIP/RTP session for as long as is needed. We support SIP integration with CPaaS platforms like Twilio, Signalwire and Telnyx. We also support CCaaS platforms like Genesys, Cisco and Avaya.

2) We also support direct phone number ordering and SIP Trunks from the Voicegain Web Console. More integrations will be added soon.

Telephony Bot APIs are based on web callbacks where the actual program/ implementation is on the Client side and the Voicegain Telephony Bot APIs define the Requests and Responses. The meaning of Requests and Responses is reversed w.r.t what you would see in a normal Web API:

Responses provide the commands, while
Requests provide the outcome of those commands.

Illustrated example of Telephony Bot API in action

Below is an example of a simple phone call interaction which is controlled by Telephony Bot API. The sequence diagram shows 4 callbacks during a toy survey call:

Req 1: Phone Call arrived
Resp 1: Say: "Welcome"
Req 2: Done saying "Welcome"
Resp 2: Ask: "Are you happy", bind reply to happy var
Req 3: Caller's answer was "yes", happy=YES
Resp 3: Disconnect
Req 4: Disconnected
Resp 4: We are done

‍

Currently supported actions

Telephony Bot API supports 4 types of actions:

output: say something - TTS with a choice of 8 different voices is supported
input: ask question - both speech input and DTMF are supported. For speech input you can use GRXML, JSGF or built-in grammars
transfer: transfer a call to a phone destination
disconnect: end the call

Wait, there is more

Each call can be recorded (two channel recording) and then transcribed. The recording and the transcript can be accessed from the portal as well as via the API.

Roadmap

Features coming soon:

record Callback action - you can use it to implement voicemail or record other types of messages
transfer to a sip destination
input - allow choice of large vocabulary speech-to-text in addition to grammars - use the captured text in your NLU
answer call at a sip address - instead of a phone number
WebRTC support
outbound dialing

‍

Developers

Python SDK Available

Jacek Jarmulak

•

min read

•

August 6, 2020

As of August 5th, 2020, programming in Python against Voicegain Speech-to-Text (STT) API got even easier with the release of official voicegain-speech package to Python Package Index (PyPI) repository.

The SDK package is available at: https://pypi.org/project/voicegain-speech/

The SDK source code is available at: https://github.com/voicegain/python-sdk

This package wraps Voicegain Speech-to-Text Web API. A preview of the API spec can be found at: https://www.voicegain.ai/api

Full API spec documentation is available at: https://console.voicegain.ai/api-documentation

The core APIs are for Speech-to-Text, either transcription or recognition (further described below).Other available APIs include:

RTC Callback APIs which in addition to speech-to-text allow for control of RTC session (e.g., a telephone call).
Websocket APIs for managing broadcast websockets used in real-time transcription.
Language Model creation and manipulation APIs.
Data upload APIs that help in certain STT use scenarios.
Training Set APIs - for use in preparing data for acoustic model training.
GREG APIs - for working with ASR and Grammar tuning tool - GREG.

Transcribe API

/asr/transcribeThe Transcribe API allows you to submit audio and receive the transcribed text word-for-word from the STT engine. This API uses our Large Vocabulary language model and supports long form audio in async mode.

The API can, e.g., be used to transcribe audio data - whether it is podcasts, voicemails, call recordings, etc. In real-time streaming mode it can, e.g., be used for building voice-bots (your the application will have to provide NLU capabilities to determine intent from the transcribed text).

The result of transcription can be returned in four formats:

Transcript - Contains the complete text of transcription
Words - Intermediate results will contain new words, with timing and confidences, since the previous intermediate result. The final result will contain complete transcription.
Word-Tree - Contains a tree of all feasible alternatives. Use this when integrating with NL postprocessing to determine the final utterance and its meaning.
Captions - Intermediate results will be suitable to use as captions (this feature is in beta).

Recognize API

/asr/recognizeThis API should be used if you want to constrain STT recognition results to the speech-grammar that is submitted along with the audio (grammars are used in place of the large vocabulary language model).

While having to provide grammars is an extra step (compared to Transcribe API), they can simplify the development of applications since the semantic meaning can be extracted along with the text.

Another advantage of using grammars is that they can ignore words in the utterance that are outside of grammar - still delivering recognition although with lower confidence.

Voicegain supports grammars in the JSGF and GRXML formats – both grammar standards used by enterprises in IVRs since early 2000s.The recognize API only supports short form audio - no more than 60 seconds.

‍

Developers

CORS Support Added in 1.9.0

Jacek Jarmulak

•

min read

•

July 12, 2020

We have recently added support for CORS (Cross Origin Resource Sharing) in our APIs. This was in response to our customers asking for it in order to enable them building Speech-to-Text web applications with minimal effort. By making web API requests to Voicegain Speech API directly from their web clients the application can be simpler and more efficient.

Examples of simple applications that our customers are implementing this way are: microphone input capture and transcription (e.g. to capture and transcribe meeting notes), or offline-audio file transcription.

Users have full control, via security settings, over which Origins should be allowed to make the CORS requests.