Blog | Speech-to-Text Platform

Announcement

Voicegain Appoints Tracy Puleo as Vice President of Sales to Accelerate Voice AI growth in Healthcare Call Centers

Arun Santhebennur

•

min read

•

June 10, 2026

DALLAS, June 9, 2026 /PRNewswire-PRWeb/ -- Voicegain, a leading provider of AI-powered voice solutions for healthcare payers and contact centers, today announced the appointment of Tracy Puleo as Vice President of Sales.

In this role, Tracy will lead Voicegain's sales strategy, revenue growth initiatives, and customer acquisition efforts as the company rapidly scales its presence among health plans – Commercial, Medicaid and MA, third-party administrators (TPAs), and healthcare organizations seeking to transform member and provider experiences through generative Voice AI.

Tracy will lead sales for Voicegain Casey, a healthcare payer-focused software suite of three products that span the entire caller journey. They are (1) Conversational AI Voice Agents (2) Real-time Agent Assist (AI Co-Pilot) and (3) AI-Powered QA and Coaching Automation and Voice-of-Customer analytics. With the Voicegain Casey suite, healthcare organizations can elevate the member experience while lowering the operating costs of call centers.

Voicegain has rapidly emerged as a trusted AI partner for healthcare payer organizations with Voicegain Casey being used by over a dozen companies including Alliance Health, Samaritan Health, UnitedAg and Cottingham Buttler. Casey enables these organizations to augment their call center staff with real-time AI powered guidance and to automate routine member and provider inquiries like claims, eligibility and prior authorization status. Casey also analyzes 100% of all voice customer interactions and generates an automated QA score, extracts caller sentiment, CSAT and other Gen AI powered insights.

Voicegain Casey is built on the Voicegain platform, a leading privacy-first Voice AI platform that transcribes over 3 Billion minutes of audio for leading enterprises and mid-market companies. It is HIPAA, PCI and SOC-2 compliant and supports PII redaction, speaker diarization and 99 languages.

"Tracy is a proven healthcare sales leader with a strong track record of building relationships, delivering results, and helping healthcare organizations solve complex challenges," said Arun Santhebennur, Co-Founder and CEO of Voicegain. "Health plans face significant pressures to maintain and improve their HEDIS/STAR ratings and lower their administrative costs. They are looking for practical and proven AI solutions that improve member experience, increase operational efficiency, and drive measurable outcomes. Tracy's expertise and leadership will be instrumental in helping us accelerate our growth and expand our impact across the healthcare industry."

Tracy brings extensive experience in healthcare technology, payer engagement, customer experience, and enterprise sales and has led Sales for organizations like Zipari, Vimly Benefits, and ClickBoarding. Throughout her career, she has successfully partnered with health plans and healthcare organizations to implement innovative software solutions that increase member satisfaction, enhance operational performance, and support organizational growth.

"I am excited to join Voicegain at such a pivotal time," said Tracy Puleo. "Healthcare organizations are under tremendous pressure to improve member experiences while controlling costs and increasing efficiency. Voicegain Casey addresses these challenges in a meaningful way, and I look forward to working with our customers and partners to help them realize the full value of AI-driven engagement."

As Vice President of Sales, Tracy will focus on expanding Voicegain's customer base with healthcare payers, strengthening strategic partnerships, and helping organizations leverage AI to improve outcomes for members, providers, and contact center teams.

About Voicegain

Voicegain is a healthcare focused Voice AI company that offers AI Voice Agents, Real-time Agent Assist, Voice-of-Customer based analytics and automated quality assurance solutions. These products are designed to improve contact center efficiency and performance and elevate the member experiences.

Media Contact

Arun Santhebennur

Co-founder & CEO, Voicegain

Email: Arun@voicegain.ai

Website: https://www.voicegain.ai

Media Contact

Arun Santhebennur, Voicegain, 1 9725180863 701, arun@voicegain.ai, https://www.voicegain.ai/conversational-ivr

SOURCE Voicegain

‍

Developers

Python script for testing automated speech recognition (ASR) accuracy

Jacek Jarmulak

•

min read

•

November 2, 2020

Many of our customers have been asking us for help in benchmarking Voicegain speech-to-text recognizer (ASR) on their specific audio files. To make this benchmarking easier we have released a python script that accomplishes just that. With a single command line you can transcribe all audio files from the input directory and compare them against reference transcripts - calculating the WER for each file. You can also do a 2-way comparison of reference vs Voicegain transcript vs Google Speech-to-Text transcript.

The script and the documentation is available at: https://github.com/voicegain/platform/tree/master/utility-scripts/test-transcribe

See our benchmark blog post to give you an idea of what kind of accuracy to expect from the Voicegain recognizer.

‍

Model Training

Custom ASR with Acoustic Model Training - Two Case Studies

Jacek Jarmulak

•

min read

•

September 27, 2020

Updated: Feb 28 2022

In this blog post we describe two case studies to illustrate improvements in speech-to-text or ASR recognition accuracy that can be expected from training of the underlying acoustic models. We trained our acoustic model to recognize Indian and Irish English better.

Case study setup

The Voicegain out-of-the-box Acoustic Model which is available as default on the Voicegain Platform had been trained to recognize mainly US English although our training data set did contain some British English audio. The training data did not contain Indian and Irish English, except for maybe accidental occurrences.

Both case studies were performed in an identical manner:

Training data contained about 300 hours of transcribed speech audio.
Training was done to get improved accuracy on the new type of data but at the same time to also retain the baseline accuracy. An alternative would have been to aim for maximum improvement on the new data at expense of accuracy of the baseline model.
Training was stopped after significant improvement was achieved. It could have been continued to achieve further improvement, although that might have been marginal.
Benchmarks presented here were done on data that was not included in the training set.

Case Study 1: Indian English

Here are the parameters of this study.

We had 250 hours of audio containing male and female speakers, each speaker reading about 50 minutes worth of speech audio.
We separated 6 speakers for the benchmark, selecting 3 male and 3 female samples. Samples were selected to contain both easy, medium, and difficult test cases.

Here are the results of the benchmark before and after training. For comparison. we also include results from Google Enhanced Speech-to-Text.

‍

‍

Some observations:

All 6 test speakers show significant improvement over the original accuracy.
After training the accuracy of 5 speakers is better than Google Enhanced Speech-to-Text. The one remaining speaker improved a lot (from 62% to 76%) but the accuracy was still not as good as Google. We examined the audio and it turns out that it was not recorded properly. The speaker was speaking very quietly and the microphone gain was set very high - this resulted in the audio containing a lot of strange artifacts, like e.g. tongue clicking. The speaker also ready the text in a very unnatural "mechanical" way. Kudos to Google for doing so well on such a bad recording.
On average custom-trained Voicegain speech-to-text was better by about 2% on our Indian English benchmark compared to Google Enhanced recognizer.

Case Study 2: Irish English

Here are the parameters of this study.

We collected about 350 hours of transcribed speech audio from one speaker from Northern Ireland.
For the benchmark we retained some audio from that speaker that was not used for training plus we found audio from 5 other speakers with various types of Irish English accents.

Here are the results of the benchmark before and after training. We also include results from Google Enhanced Speech-to-Text.

‍

‍

Some observations:

The speaker that was used for training is labeled here as 'Legge'. We see huge improvement after training from 76.2% to 88.5% which is significantly above Google Enhanced with 83.9%
The other speaker with over 10% improvement is 'Lucas' which has a very similar accent to 'Legge'.
We looked in detail at the audio of the speaker labeled 'Cairns' who had the least improvement and for whom Google was better than our custom trained recognizer. The audio has significantly lower quality that the other samples plus it contains noticeable echo. Its audio characteristics are quite different from that audio characteristics of the training data used.
On average custom trained Voicegain speech-to-text was better by about 1% on our Irish English benchmark compared to Google Enhanced recognizer.

Further Observations

The amount of data used in training at 250-350 hours was not large given that normally Acoustic Models for speech recognition are trained on 10s of thousands of hours of audio.
The large improvement on 'Legge' speaker suggest that if the goal is to improve recognition on very specific type of speech or speaker the training set could be lower, maybe 50 to 100 hours, to achieve significant improvement.
Bigger training set may be needed - 500 hours or more - in cases where the variability of speech and other audio characteristics is large.

UPDATE Feb 2022

We have published 2 additional studies showing the benefits of Acoustic Model training:

Interested in Voicegain? Take us for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account and receive $50 in free credits

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.

CPaaS

Large vocabulary transcription for Twilio developers

Jacek Jarmulak

•

min read

•

September 27, 2020

In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.

The reasons we think it will be attractive to Twilio users are:

lower cost per each speech-to-text capture
higher accuracy for customers who choose Acoustic Model customization
access to all speech-to-text hypotheses in word-tree output mode

Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.

Initiating Speech Transcription with Voicegain

This is done by invoking Voicegain async transcribe API: /asr/transcribe/async

Below is an example of the payload needed to start a new transcription session:

‍

Some notes about the content of the request:

we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)
startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
asr settings include the two timeouts used in transcription - no-input, and complete timeouts.

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:

‍

Some notes about the content of the TwiML request:

the websocket URL is the one returned from Voicegain /asr/transcribe/async request
more than one question prompt is supported - they will be played one after another
three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Transcription Response

Below is an example response from the transcription in case where "content" : {"full" : ["transcript"] } .

‍

Use Cases

Live Transcription Example

Jacek Jarmulak

•

min read

•

September 27, 2020

We want to share a short video showing live transcription in action at CBC. This one is using our baseline Acoustic Model. No customizations were made, no hints used. This video gives an idea of what latency is achievable with real-time transcription.

‍

Automated real-time transcription is a great solution for accommodating hearing impaired if no sign-language interpreter is available. I can be used, e.g., at churches to transcribe sermons, at conventions and meetings to transcribe talks, at educational institutions (schools, universities) to live transcribe lessons and lectures, etc.

‍

Voicegain Platform provides a complete stack to support live transcription:

Utility for audio capture at source
Cloud based or On-Prem transcription engine and API
Web portal for controlling multiple simultaneous live transcriptions
Web-based viewer app to enable following the transcription on any device with web browser. This app can also be embedded into any web page.

Very high accuracy - above that provided by Google, Amazon, and Microsoft Cloud speech-to-text - can be achieved through Acoustic Model customization.

CPaaS

How to use Voicegain with Twilio Media Streams

Jacek Jarmulak

•

min read

•

September 15, 2020

Voicegain adds grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.

The difference between Voicegain speech recognition and Twilio TwiML <Gather> is:

Voicegain supports grammars with semantic tags (GRXML or JSGF) while <Gather> is a large vocabulary recognizer that just returns text, and
Voicegain is significantly cheaper (we will describe the price difference in an upcoming blog post).

When using Voicegain with Twilio, your application logic will need to handle callback requests from both Twilio and Voicegain.

Each recognition will involve two main steps described below:

Initiating Speech Recognition with Voicegain

This is done by invoking Voicegain async recognition API: /asr/recognize/async

Below is an example of the payload needed to start a new recognition session:

‍

Some notes about the content of the request:

startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
asr settings include the three standard timeouts used in grammar based recognition - no-input, complete, and incomplete timeouts
grammar is set to GRXML grammar loaded from an external URL

This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.

Note, if the grammar is specified to recognize DTMF, the Voicegain recognizer will recognize DTMF signals included in the audio sent from Twilio Platform.

TwiML <Connect><Stream> request

After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:

‍

Some notes about the content of the TwiML request:

the websocket URL is the one returned from Voicegain /asr/recognize/async request
more than one question prompt is supported - they will be played one after another
three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
bargeIn is enabled - prompt playback will stop as soon as caller starts speaking

Returned Recognition Response

Below is an example response from the recognition. This response is from built-in phone grammar.

‍

Benchmark

Speech-to-Text Accuracy Benchmark Revisited

Jacek Jarmulak

•

min read

•

November 13, 2020

Some of the feedback that we received regarding the previously published benchmark data, see here and here, was concerning the fact that the Jason Kincaid data set contained some audio that produced terrible WER across all recognizers and in practice no one would user automated speech recognition on such files. That is true. In our opinion, there are very few use cases where WER worse than 20%, i.e. where on average 1 in every 5 words is recognized incorrectly, is acceptable.

New Methodology

What we have done for this blog post is we have removed from the reported set those benchmark files for which none on the recognizers tested could deliver WER 20% or less. This criterion resulted in removal of 10 files - 9 from the Jason Kincaid set of 44 and 1 file from the rev.ai set of 20. The files removed fall into 3 categories:

recordings of meetings - 4 files (this amounts to half of the meeting recordings in the original set),
telephone conversations - 4 files (4 out of 11 phone phone conversations in the original set),
multi-presenter, very animated podcasts - 2 files (there were a lot of other podcasts in the set that did meet the cut off).

The results

As you can see, Voicegain and Amazon recognizers are very evenly matched with average WER differing only by 0.02%, the same holds for Google Enhanced and Microsoft recognizer with the WER difference being only 0.04%. The WER of Google Standard is about twice of the other recognizers.

‍