Blog | Speech-to-Text Platform

Contact Center

Voicegain Acquires TrampolineAI to deliver End-to-End Contact Center AI for Healthcare Payers

Arun Santhebennur

•

min read

•

January 7, 2026

New unified platform combines AI voice agent automation with Real-time agent assistance and Auto QA, enabling healthcare payers to reduce average handle time (AHT) and improve first contact resolution (FCR) in their call centers.

IRVING, Texas and SAN FRANCISCO, Jan. 7, 2026 /PRNewswire-PRWeb/ -- Voicegain, a leader in AI Voice Agents and Infrastructure, today announced the acquisition of TrampolineAI, a venture-backed healthcare payer-focused Contact Center AI company whose products supports thousands of member interactions. The acquisition unifies Voicegain's AI Voice Agent automation with Trampoline's real-time agent assistance and Auto QA capabilities, enabling healthcare payers to optimize their entire contact center operation—from fully automated interactions to AI-enhanced human agent support.

Healthcare payer contact centers face mounting pressure to reduce costs while improving member experience. The reasons vary from CMS pressure, Medicaid redeterminations, Medicare AEP volume and staffing shortages. The challenge lies in balancing automation for routine inquiries with personalized support for complex interactions. The combined Voicegain and TrampolineAI platform addresses this challenge by providing a comprehensive solution that spans the full spectrum of contact center needs—automating high-volume routine calls while empowering human agents with real-time intelligence for interactions that require specialized attention.

"We're seeing strong demand from healthcare payers for a production-ready Voice AI platform. TrampolineAI brings deep payer contact center expertise and deployments at scale, accelerating our mission at Voicegain." — Arun Santhebennur

Over the past two years, Voicegain has scaled Casey, an AI Voice Agent purpose-built for health plans, TPAs, utilization management, and other healthcare payer businesses. Casey answers and triages member and provider calls in health insurance payer call centers. After performing HIPAA validation, Casey automates routine caller intents related to claims, eligibility, coverage/benefits, and prior authorization. For calls requiring live assistance, Casey transfers the interaction context via screen pop to human agents.

TrampolineAI has developed a payer-focused Generative AI suite of contact center products—Assist, Analyze, and Auto QA—designed to enhance human agent efficiency and effectiveness. The platform analyzes conversations between members and agents in real-time, leveraging real-time transcription and Gen AI models. It provides real-time answers by scanning plan documents such as Summary of Benefits and Coverage (SBCs) and Summary Plan Descriptions (SPDs), fills agent checklists automatically, and generates payer-optimized interaction summaries. Since its founding, TrampolineAI has established deployments with leading TPAs and health plans, processing hundreds of thousands of member interactions.

"Our mission at Voicegain is to enable businesses to deploy private, mission-critical Voice AI at scale," said Arun Santhebennur, Co-founder and CEO of Voicegain. "As we enter 2026, we are seeing strong demand from healthcare payers for a comprehensive, production-ready Voice AI platform. The TrampolineAI team brings deep expertise in healthcare payer operations and contact center technology, and their solutions are already deployed at scale across multiple payer environments."

Through this acquisition, Voicegain expands the Casey platform with purpose-built capabilities for payer contact centers, including AI-assisted agent workflows, real-time sentiment analysis, and automated quality monitoring. TrampolineAI customers gain access to Voicegain's AI Voice Agents, enterprise-grade Voice AI infrastructure including real-time and batch transcription, and large-scale deployment capabilities, while continuing to receive uninterrupted service.

"We founded TrampolineAI to address the significant administrative cost challenges healthcare payers face by deploying Generative Voice AI in production environments at scale," said Mike Bourke, Founder and CEO of TrampolineAI. "Joining Voicegain allows us to accelerate that mission with their enterprise-grade infrastructure, engineering capabilities, and established customer base in the healthcare payer market. Together, we can deliver a truly comprehensive solution that serves the full range of contact center needs."

A TPA deploying TrampolineAI noted the platform's immediate impact, stating that the data and insights surfaced by the application were fantastic, allowing the organization to see trends and issues immediately across all incoming calls.

The combined platform positions Voicegain to deliver a complete contact center solution spanning IVA call automation, real-time transcription and agent assist, Medicare and Medicaid compliant automated QA, and next-generation analytics with native LLM analysis capabilities. Integration work is already in progress, and customers will begin seeing benefits of the combined platform in Q1 2026.

Following the acquisition, TrampolineAI founding team members Mike Bourke and Jason Fama have joined Voicegain's Advisory Board, where they will provide strategic guidance on product development and AI innovation for healthcare payer applications.

The terms of the acquisition were not disclosed.

About Voicegain

Voicegain offers healthcare payer-focused AI Voice Agents and a private Voice AI platform that enables enterprises to build, deploy, and scale voice-driven applications. Voicegain Casey is designed specifically for healthcare payers, supporting automated and assisted customer service interactions with enterprise-grade security, scalability, and compliance. For more information, visit voicegain.ai.

About TrampolineAI

TrampolineAI was a venture-backed voice AI company focused on healthcare payer solutions. The company applies Generative Voice AI to contact centers to improve operational efficiency, member experience, and compliance through real-time agent assist, sentiment analysis, and automated quality assurance technologies. For more information, visit trampolineai.com.

Media Contact:

Arun Santhebennur

Co-founder & CEO, Voicegain

press@voicegain.ai

Media Contact

Arun Santhebennur, Voicegain, 1 9725180863 701, arun@voicegain.ai, https://www.voicegain.ai

SOURCE Voicegain

‍

Developers

Streaming Audio data from Contact Center platforms to enable Generative AI Voice apps like Realtime Agent-Assist and AI Co-Pilot

Arun Santhebennur

•

min read

•

March 4, 2021

This article outlines various options for how developers and builders of real-time Gen AI voice applications in contact center should design and architect access to streaming audio data from IP-based Contact Centers systems. These Contact Center systems can be premise-based contact center platforms like Avaya, Cisco, Genesys or CCaaS platforms like Five9, Genesys Cloud, NICE CXOne and Aircall.

Use Case for Realtime Generative AI Voice for Contact Center

One of the main use cases for Realtime Generative Voice AI in a contact center is Realtime Agent Assist (RTAA) or a generative AI Co-Pilot. The first step for any such realtime application is to stream audio from Contact Center platforms to a streaming Speech-to-Text model and get the speaker separated transcript. This transcript in turn can be integrated with an LLM for real-time sentiment analysis, QA automation agent assist, summarization and other real-time AI use cases in the contact center.

Voicegain's inhouse Kappa model is one such streaming speech-to-text model. The real-time transcript is made available by Voicegain over websockets.

Architecture Options to get Real-time Audio data

Overall there are 3 main approaches to get access to real-time audio streams

Voicegain SIP Media Stream B2BUA (For On-Premise Systems)
SIPREC from the SBC (Under Development)
Programmable Integration (leveraging APIs provided by CCaaS platforms )‍

The details of each of those approaches are described below

SIP Media Stream B2BUA

Most on-premise contact center platforms, like Avaya, Genesys and Cisco do not provide programmatic access to the media streams. Instead they all offer the ability to transfer a call to a SIP destination/URI. This is in turn can be provided by the Voicegain SIP Media Stream B2BUA. In other words, the Voicegain SIP Media Stream B2BUA can accept a call from such a SIP INVITE.

More details of the SIP Media Stream B2BUA can be found here

SIPREC from Session Border Controller (currently in Beta)

Most enterprise premise-based Contact Center platforms include a network element called the Session Border Controller (SBC). The SBCs can be thought of as a SIP-aware firewall that is architected "in front" of a premise-based IP Contact Center. SBCs support the forking of audio streams using a protocol called SIPREC and this has been used over the years by active/compliant call recording vendors like NICE and Verint.

With SIPREC, an SBC essentially provides a mirror or fork of the real-time RTP stream from the telephone call. This can be sent to Voicegain's SIPREC Server (currently in beta).

Voicegain has a beta version of a SIPREC interface has been tested with the following platforms:

Avaya Enterprise SBC
Ribbon/Sonus SBC
Broadsoft SIPREC sipua
Cisco Cisco Unified Border Element (CUBE)
Metaswitch SIPREC sipua - The minimal version of Metaswitch that supports SIPREC is 9.0.10
Oracle SBC SIPREC - SelectiveCall Recording SIPREC (oracle.com)
Twilio TwiML <Siprec>

Voicegain can capture relevant call metadata in addition to obtaining the audio (the metadata capture functionality may differ in capabilities depending on the client platform).

Voicegain platform can be configured to automatically launch transcription and speech-analytics as soon as the new SIPREC session gets established.

SIPREC support is available both in the Cloud and the Edge (OnPrem) deployments of the Voicegain Platform.

SIPREC is an Enterprise feature of the Voicegain platform and is not included in the base package. Please contact support@voicegain.ai or submit a Zendesk ticket for more information about SIPREC and if you would like to use it with your existing Voicegain account.

Programmable Integration with CCaaS real-time audio streaming APIs

Some CCaaS platforms, in particular the modern one provide APIs to get programmatic access to the real-time audio stream. In many of them such a capability was added specifically to simplify integration with Cloud Speech-to-Text services.

Examples of such CCaaS platforms are :

Five9 VoiceStream
Genesys Audiohook‍
Avaya DMCC (which is part of Avaya Aura® Application Enablement (AE) Services) to open RTP streams with the content of the call
Use Extended Media Forking (XMF) provided by Cisco Unified Communications Gateway Services

Voicegain Platform integrates with the APIs multiple protocols that allow for flexible programmable integration:

websockets - sending binary audio data over websocket is supported. In addition to binary data, message protocols used in Twilio and SignalWire for audio streaming over websocket are also supported. (If required, we can easily add support for additional message protocols.)
gRPC - binary audio data may also be sent using gRPC protocol. Note, that this capability is currently in beta.
plain RTP. Voicegain also supports plain RTP. The IP/port/encoding negotiation, however, has to be done using our HTTP API. We do not support RTCP nor RTSP. The HTTP API is very simple and we have already had some of our customers integrate this type of plain RTP streaming using XMF within the Cisco UC environment.

All those protocols support uLaw, aLaw, and Linear 16-bit encoding in either 8- or 16kHz sample rate.

Contact us to discuss or brainstorm!

If you are building a voice Gen AI application and you would like to discuss getting access to realtime audio data, please contact us at support@voicegain.ai

‍

Speech Analytics

PII Text and Audio Redaction now available in Speech Analytics API

Jacek Jarmulak

•

min read

•

February 21, 2021

Our latest release (1.24.0) expands Voicegain Speech Analytics and Transcription API with ability to redact sensitive data both in transcript and in audio. This allows our customers to be compliant with standards like HIPAA, GDPR, CCPA, PCI or PIPEDA.

Any of the following types of Named Entities can be redacted in transcript text and/or the audio file.

ADDRESS - Postal address.
CARDINAL - Numerals that do not fall under another type.
CC - Credit Card
DATE - Absolute or relative dates or periods.
EMAIL - (coming soon) Email address
EVENT - Named hurricanes, battles, wars, sports events, etc.
FAC - Buildings, airports, highways, bridges, etc.
GPE - Countries, cities, states.
NORP - Nationalities or religious or political groups.
MONEY - Monetary values, including unit.
ORDINAL - "first", "second", etc.
ORG - Companies, agencies, institutions, etc.
PERCENT - Percentage, including "%".
PERSON - People, including fictional.
PHONE - (coming soon) Phone number.
QUANTITY - Measurements, as of weight or distance.
SSN - Social Security number
TIME - Named documents made into laws.
ZIP - (coming soon) Zip Code (if not part of an Address)

In the audio they are replaced with silence and in the transcript they are replaced with a string specified when making the API request.

This feature is supported both in Cloud and on the Edge (on-prem).

Two typical use cases are:

Enable redaction as part of normal processing, of e.g. call center calls
Do a bulk processing of previously underacted audio in storage to achieve compliance. Combined with low per minute price of Voicegain APIs, this allows our customers to cost effectively process large qualities of audio data.

Languages

Voicegain offers Spanish Speech-to-Text

Jacek Jarmulak

•

min read

•

February 21, 2021

Last week we announced that Spanish Speech-to-Text capability would be available from Voicegain in March. We are pleased to announce today that we have been able to complete training of the Spanish Neural Network Model earlier than expected and the Spanish Speech-to-Text has been released last Saturday (2/20) as part of our Release 1.24.0.

We have been able to complete work on the Spanish model from start to finish in exactly 3 weeks - we started working on it February 3rd. Such fast progress was possible because of our extensive experience with customization of Neural Network Models for speech recognition and the fact that we have developed advanced tools and proven techniques that make speech-to-text model development and training fast.

The recognition accuracy of the model depends on the type of speech audio. For most benchmark files our Spanish model accuracy is just a few % behind that of Google or Amazon recognizers. The advantage of our recognizer is the significantly lower price plus ability to train customized acoustic models. Custom models can have accuracy higher than that of Amazon or Google. We encourage you to use our Web Console and/or API to test the real-life performance on your own data. BTW, we are focusing this speech-to-text model on Latin American Spanish.

Of course, Voicegain platform offers other advantages too like support for Edge (on-prem) deployments and extensive API with many options for out-of-the-box integration into e.g. telephony environments.

Currently, Speech-to-Text API is fully functional with the Spanish Model. Some of the Speech Analytics API functions are not yet available for Spanish, e.g., Named Entity Recognition or Sentiment/Mood detection.

Initially the Spanish Model is available only in the version that supports off-line transcription. Real-time version of the Model will be available in the near future,

To tell the API that you want to use the Spanish Acoustic Model all you need to do is choose it in the Context settings. Spanish models have 'es' in the name, e.g. VoiceGain-ol-es:1

Telephony

Unique feature: RTP streaming support

Jacek Jarmulak

•

min read

•

June 27, 2021

Voicegain speech-to-text platform has supported RTP streaming from the very beginning. One of our first applications, several years ago, was live transcription with ffmpeg utility used to capture audio from a device and to stream it to the Voicegain platform using RTP. Over time we added more robust protocols and RTP was rarely used. However, recently in one of our deployments we came across a use case where RTP streaming allowed our customer to do integration in a very straightforward way within a call-center telephony stack.

Voicegain platform does support more advanced streaming protocols for call-center use like SIPREC or SIP/RTP (SIP Invite). However, in this particular use we were able to stream from Cisco CUBE directly to Voicegain using plain RTP. Upon receiving an incoming call a script is triggered which uses HTTP to establish new Voicegain transcription session. In the session response, ip:port parameters for the RTP receiver specific to the session are returned and these are passed to the CUBE to establish a direct RTP connection.

RTP used like this provides no authentication and security which would make it generally unsuitable for use over Internet. However, in this particular use case our customer benefits from the fact that the entire Voicegain stack can be deployed on-prem. Because of being on the same isolated network as the CUBE there are no issues with security and/or packet loss.

An example

You can visit out github to see a python code example which shows how to establish the speech-to-text session, how to point the RTP sender to the receiver endpoint, and how to receive real-time transcription result via a websocket.

The command to establish the session is as simple as this:

‍

Audio section defines the RTP streaming part, and the websocket section defines how the results will be sent back over a websocket.

The response looks like this:

‍

In the github example the stream.ip and stream.port are passed to ffmpeg that is used as the RTP streaming client. The example further illustrates how to process the messages with incremental transcription results sent real-time over the websocket.

Speech Analytics

Voicegain Speech Analytics API Generally Available

Jacek Jarmulak

•

min read

•

January 20, 2021

Voicegain has released its Speech Analytics (SA) API that supports variety of analytics tasks performed on the audio or the transcript of that audio. The features supported by Voicegain SA API were chosen to support our target main use case which is processing Call Center calls.

Things that Speech Analytics can do now (from release 1.22.0)

The current release supports offline Speech Analytics. The data that can be obtained through Speech Analytics API is listed below.

Note, here we do not include things that can be obtained also from our Transcribe API, like: transcript, decibel values, audiozones, etc. These, however, will be accessible from the Speech Analytics API response.

Per channel analytics:

gender - likely gender of the speaker based on the voice characteristics. Currently either "male" or "female".
emotion - Both totals over the entire call and a list of values computed at multiple places in the transcript. Each item will contain values of: (1) sentiment - from -1.0 (mad/angry) to +1.0 (happy/satisfied)(2) mood - a map with estimated values (range 0.0 to 1.0) for the following moods: "neutral" "calm" "happy" "sad" "angry" "fearful" "disgust" "surprised"(3) location - start and end in msec and index of the word
Named Entities recognized in the call. This will be a list with the entity type and the location in the call. NER values that are supported are: CARDINAL - Numerals that do not fall under another type.DATE - Absolute or relative dates or periods.EVENT - Named hurricanes, battles, wars, sports events, etc.FAC - Buildings, airports, highways, bridges, etc.GPE - Countries, cities, states.NORP - Nationalities or religious or political groups.MONEY - Monetary values, including unit.ORDINAL - "first", "second", etc.ORG - Companies, agencies, institutions, etc.PERCENT - Percentage, including "%".PERSON - People, including fictional.QUANTITY - Measurements, as of weight or distance.TIME - Named documents made into laws.
keywords - list of keywords or keyword groups recognized in the call. Keywords to be recognized can easily to configured from examples.
profanity - this is essentially a predefined keyword group
talk metrics - things like maximum and average talk streak, talk rate, energy
overtalk metrics - overtalk happens if the speaker starts speaking while the other speaker is already speaking.

Global analytics:

silence metrics - Defined as time when none of the channels is speaking. Note: Only the Agent is assumed to be in control of the speaking time. This a simplification, but it is difficult to determine of any silence was caused by the caller and was unavoidable.
word cloud frequencies - smart word cloud data with stop words removed and word variations collapsed before computing frequencies

Speech Analytics features coming soon

Real-time Speech Analytics will be available in the near future. Soon we also plan to release Score Card support for Speech Analytics.

Per channel analytics coming soon:

Two additional named entities: CC - Credit Card,SSN - Social Security number
age - estimated age of the speaker based on the voice characteristics. Three possible values: "young-adult" "senior" "unknown"
phrases - list of phrases or phrase groups recognized in the call. These are identified using NLU algorithms - essentially the same as used for identifying NLU intents. Phrases to be recognized can be configured from examples.
pitch statistics will be added to talk metrics

Additionally, we will soon support PII redaction of any named entity from either transcript or audio.

Supported audio types

Speech Analytics API supports the following types of audio input:

2-channel (stereo) audio as typically found in call centers where the Caller voice is recorded in one channel and the Agent voice is recorded in the other channel. Some metrics, like overtalk e.g., can only be computed if the input audio is of this type.
1 channel audio with two speakers - for this audio type diarization will be performed to separate the two speakers. The per-channels analytics will be performed after diarization. Overtalk metrics are not available for this use case.

You can see the API specification here.

‍

ASR

Combining grammar-based and large vocabulary speech recognition

Jacek Jarmulak

•

min read

•

January 19, 2021

In this blog post we present a unique feature of the Voicegain speech-to-text platform that efficiently combines the use of grammars with the use of large vocabulary models to provide developers with the ability to achieve high recognition accuracy in a very efficient and convenient way.

Two Types of Speech Recognition

Speech recognition (ASR) systems generally can be divided into two types:

Large Vocabulary Continuous Speech Recognition

This type of recognizer is generally used for transcription where the vocabulary is very broad and the length of the speech audio is unlimited (except for practical e.g. resource related limit). Typical components and processing steps of such a system are illustrated below:

‍

‍

The working of such a system is as follows: (s) The audio signal is processed into features. (b) The features are fed into an acoustic model processor. The processor converts data from the acoustic realm to text/linguistic or some other intermediate (e.g. audio embeddings) realm. The output values may be phonemes, letters, word pieces, audio embeddings, etc., presented as vectors of probabilities. (c) These vectors are then passed to search/optimization component. Search uses the language model to decide which hypotheses formed from the output of the previous stage are most likely to be the correct textual interpretation of the input speech audio.

The Language Models used may take variety of forms. Two of the many possible manifestations are: (a) ARPA language models, which are n-gram based, and (b) Neural Network language models where a neural network (e.g., RNN) is trained to represent a language model. Some of the Language models can also incorporate a decoder part, if the acoustic model output is encoded (e.g. if it is represented by acoustic embedding).

Because the vocabulary of this type of recognizers is large, they are prone to misrecognitions. This is particularly the case for short utterances that do not provide much context for the language model to sufficiently constrain the hypotheses. An example would be misrecognizing “card” as “car” if that is the only word that is said and a speaker has a specific accent.

Cloud speech-to-text offerings from the Big Cloud providers - Google, Amazon, and Microsoft are all examples of Large Vocabulary ASRs.

Grammar-Based Speech Recognition

In such a system, the Voice Bot/IVR developer uses a context free grammar to define a set of possible utterances that can be recognized. The grammars are typically defined using the SRGS (Speech Recognition Grammar Specification) standard - either ABNF or GRXML grammar. Other types of grammars used are JSGF (JSpeech Grammar Format) and GSL (which is Nuance Grammar Specification Language).

Components and processing steps of a typical speech recognition system that uses such grammars are illustrated below:

‍

‍

In this system the evaluation of the output from acoustic model processing is done by a search/optimizer that uses the rules contained in the grammar to decide which hypotheses are acceptable. Only the utterances that can be generated from the grammar may be output.

If an utterance outside of the grammar is spoken and presented to the recognizer it may still be recognized but with low confidence. If the confidence is below a set threshold a NOMATCH will be returned.

The obvious disadvantage of using such a recognizer is that it will not recognize utterances outside the scope of grammar. Such utterances are called Out-of-Grammar utterances. However, a big advantage with this approach is that it will be less prone to misrecognition when an utterance that is spoken has been anticipated and is included in the grammar.

An additional advantage of using a grammar-based recognizer is that most grammars allow for insertion of semantic tags, which allow the grammar to not only define an utterance but also the semantic interpretation of that utterance.

Examples of such a grammar-based speech recognition system would the speech-to-text offerings like Nuance ASR or Lumenvox ASR.

Combining grammar-based and large vocabulary recognition

Clearly both types of speech recognition systems have advantages and disadvantages. It hence seems understandable that a combination of both could potentially have the advantages of both while possibly avoiding some disadvantages.

Approach using a combination of existing ASRs

A simple approach would be to combine two different speech recognition systems. One would need to create two speech recognition sessions and split the incoming audio stream so that each session is fed a copy of incoming audio. Those two sessions would process the audio separately and would output separate results that would then need to be combined. This is illustrated below:

‍

Disadvantages of using two ASR sessions

The setup as presented above has several disadvantages:

It introduces complexity in the streaming of the audio to the recognizer. Additional proxy like component needs to be added that splits the audio stream and feeds it to two separate ASR systems.
Combining the results also requires a new separate component. This is not necessarily trivial because of the different end-pointing of the two disconnected ASR systems meaning that the results will arrive at different times.
Extra compute resources will be needed to support running two separate ASR systems instead of just one.
Another disadvantage is having to pay double the license fee as each ASR will have to have a separate session license.

Voicegain approach

Voicegain platform provides a speech recognition system that combines both types of speech recognition to benefit from the advantages of both. Our system is illustrated in the figure below:

‍

‍

In this system the processing up to the output from the Acoustic model processing is essentially identical to the processing done in systems depicted in the first two figures of this post. However, after that step Voicegain includes a novel Search/Optimization module that uses both grammar and the large vocabulary language model to generate the final recognition results. The end-pointing is performed in a way that is similar to grammar-based recognizer as that seems to make most sense given the use case (but this can be modified). The final recognition result will comprise n-best results from the grammar-based recognition, if the grammar did MATCH, and one or more hypotheses from the large vocabulary recognition.

The application developer may make own decisions as to how to use the recognition result. For example, the confidence value may be used to determine whether the grammar-based result or the large vocabulary result should be used at a given point in the application.

With Voicegain’s release of 1.22.0 , this feature is Generally Available as part of our Recognize API.

An example request using our /asr/recognize/async API looks like this:

As you can see there is just one definition for the incoming audio stream. The grammar section of settings.asr contains two grammar definitions:

one is a standard JSGF grammar with literal tag format semantics,
the other is actually not a grammar but a command to turn on large vocabulary transcription for this session {type:BUILT-IN, name:transcribe}

‍

MRCP Use Case

In addition to being available in our STT API and Telephone Bot API the ability to support both gramma-based and large vocabulary recognition at the same time is supported via the MRCP interface. For example, from VXML you can pass both GRXML grammar and builtin:speech/transcribe grammar and you will receive both GRXML result and large vocabulary result.

If you are building an Intelligent Voice Assistant, Voice Bot, Speech IVR Application or any other application that could benefit from this feature, please contact us via (email info@voicegain.ai) to engage in a more in-depth discussion.

‍