Our Blog

News, Insights, sample code & more!

Enterprise
Announcing Voicegain Casey, a Generative AI Voice Agent for Health Plan and TPA Call Centers

Voicegain is excited to announce the launch of Voicegain Casey, a payer focused AI Voice Agent that transforms the end-to-end call center experience with the power of generative AI. Voicegain Casey is a software suite of the following three Voice AI SaaS applications that helps a health plan or TPA call center improve operational efficiency and increase the CSAT and NPS (Net Promoter Score):

A. Voicegain Casey - Suite of Generative AI-Powered SaaS Applications

1. AI Voice Assistant:

The AI Voice Assistant replaces a touch-tone IVR with a modern LLM-powered conversational AI Phone Agent. The AI Phone Agent can answer all calls that are received at a Health Plan or TPA Call center. It engages callers in a natural conversation and automates routine telephone calls like Claims Status, eligibility inquiries and eligibility verifications. In our experience, there is a very compelling business case to automate provider phone calls in Health Plan and TPA call centers and Voicegain Casey is specifically designed to do this. The AI Voice Assistant is also trained to perform HIPAA Validation and triaging of calls. So if the AI has not been trained to answer a specific question, it routes the call to the call center for live assistance.

2. AI Co-Pilot: 

Voicegain AI Co-Pilot is a browser extension that runs as a browser side-panel of Call Center Agent's CRM. The Co-Pilot is integrated with the Contact Center/CCaaS platform of the Payer. When a call transferred by the AI Voice Assistant is eventually answered by a Live Agent, all the information collected by the AI Voice Assistant is presented as a "Screen-Pop" on the Desktop of the Live Agent (also referred to as CTI). This CTI/Screen pop feature ensures that the front-line call center staff do not have to ask the customer to repeat any information that was provided to the AI Voice Assistant. In addition to the Screen-Pop, the AI Co-Pilot also guides the front-line call center staff in real-time by listening, transcribing and analyzing the conversation and providing real-time guidance . The AI Co-Pilot also generates a summary of the conversation within five seconds of the completion of the call. This automated summarization easily saves 1-2 mins of wrap-up time or after call work which is very common in these health plan and TPA call centers.

3. AI QA & Coach:

Voicegain AI QA & Coach is a browser-based AI SaaS application that is used by Team-leaders, QA Call Coaches/Analysts and Operations Managers in a call center. This AI SaaS app can record and measure the sentiment of the callers, analyze the QA score and provided automated coaching tips to the Agents. Voicegain uses the latest open-source reasoning LLMs (like LLAMA 3, Gemma) and closed-source reasoning models like o-3 from Open AI. With the power of modern reasoning models, almost the entire QA score-card (at least 80% of the questions) can be easily answered with modern reasoning-based LLM models. This SaaS App also provides a database of all whole-call-recordings of the entire conversation of the customer - which includes the AI Voice Assistant part, the transfer to the specific Call Center queue and eventually the entire conversation between the Live Agent and the Caller.

B. Integrations

Voicegain Casey requires the following 3 key integrations to help with automation and real-time assistance.

1. Contact Center Platform/CCaaS Platform

Voicegain Casey integrates with modern CCaaS platforms. Current Integrations include Aircall, Five9, Genesys Cloud. Planned integrations include Ringcentral, NICE CXOne and Dialpad.

2. CRM Software

Voicegain Casey integrates with the CRM software of the Health plan or the TPA. This can be an off-the-shelf CRM like Zendesk or Saleforce. It can also be a proprietary/homegrown CRM. As long as the CRM is a browser-based SaaS application, this should not be an issue. Voicegain Casey AI Co-Pilot is a browser-extension that is installed in the side-panel of the same browser tab as the CRM. At the end of the call, the summary of the call is automatically generated and available on the browser extension within 5 seconds of the end of the call.

3. Eligibility & Claims

Voicegain Casey needs access to the member data (for HIPAA Validation) and claims data.

C. Demo and Additional Information

For further information on Voicegain casey, including a demo, please visit this link

D. Give us a shout!

If you would like to understand Voicegain Casey in more detail or if you would prefer a detailed product demo over a Zoom video call, please do not hesitate to send us an email. You can reach us at sales@voicegain.ai or support@voicegain.ai

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Voicegain Speech Recognition for Voice Picking for Warehouses
Use Cases
Voicegain Speech Recognition for Voice Picking for Warehouses

Among the various speech-to-text APIs that Voicegain provides is a speech recognition API that uses grammars and supports continuous recognition. This API is ideally suitable for use in warehouse Voice Picking applications. Warehouse Management Systems can embed Voicegain APIs to offer Voice Picking as part of their feature set.

Here are more details of that specific API:

  • Audio input - supports streaming of audio via websockets for very easy integration with web based or Android/iOS applications (gRPC support is in beta)
  • Results of recognition are available via websocket or http callbacks in JSON format. Sending recognition results over websockets is a recent addition and it makes building web based voice picking applications much easier.
  • Supports grammar based recognition - better suited for a well defined set of commands compared to large-vocabulary speech-to-text. Has higher accuracy, better noise rejection, better handling of various accents, etc. Using grammas provides a benefit of fast end-pointing - the recognizer knows that the command has been completely uttered and there is no additional timeout needed to determine end-of-speech.We support a variant of JSGF grammar format which is very intuitive and easy to use.
  • Supports continuous recognition - multiple commands can be recognized in a single http session. Continuous recognition allows for the commands to be spaced closer together and allows for natural correction of misrecognitions by simple repetition.

In addition to that Voicegain Speech-to-Text platform provides additional benefits for Voice Picking applications:

  • Acoustic/language model is customizable - this allows for very high recognition accuracy for specific domains
  • Web-based tools available for reviewing utterance recognitions. These tools allow for tuning of grammars and for collection of utterances for model training.

Together this allows for your Voice Picking application to continually learn and improve.

Our APIs are available in the Cloud but can also be hosted at the Edge (on-prem) which can increase reliability and reduce the already low latencies.

If you would like to test our API and see how they would fit in your warehouse applications you can start with the fully functional example web app that we have made available on github: platform/examples/command-grammar-web-app at master · voicegain/platform (github.com)

If you have any question please email us at info@voicegain.ai. You can also sign-up for a free account on Voicegain Platform via our Web Console at: https://console.voicegain.ai/signup  

Read more → 
Streaming Audio data from Contact Center platforms to enable Generative AI Voice apps like Realtime Agent-Assist and AI Co-Pilot
Developers
Streaming Audio data from Contact Center platforms to enable Generative AI Voice apps like Realtime Agent-Assist and AI Co-Pilot

This article outlines various options for how developers and builders of real-time Gen AI voice applications  in contact center should design and architect access to streaming audio data from IP-based Contact Centers systems. These Contact Center systems can be premise-based contact center platforms like Avaya, Cisco, Genesys or CCaaS platforms like Five9, Genesys Cloud, NICE CXOne and Aircall.

Use Case for Realtime Generative AI Voice for Contact Center

One of the main use cases for Realtime Generative Voice AI in a contact center is Realtime Agent Assist (RTAA) or a generative AI Co-Pilot. The first step for any such realtime application is to stream audio from Contact Center platforms to a streaming Speech-to-Text model and get the speaker separated transcript. This transcript in turn can be integrated with an LLM for real-time sentiment analysis, QA automation agent assist, summarization and other real-time AI use cases in the contact center. 

Voicegain's inhouse Kappa model is one such streaming speech-to-text model. The real-time transcript is made available by Voicegain over websockets.

Architecture Options to get Real-time Audio data

Overall there are 3 main approaches to get access to real-time audio streams

  • Voicegain SIP Media Stream B2BUA (For On-Premise Systems)
  • SIPREC from the SBC (Under Development)
  • Programmable Integration (leveraging APIs provided by CCaaS platforms )

The details of each of those approaches are described below

SIP Media Stream B2BUA

Most on-premise contact center platforms, like Avaya, Genesys and Cisco do not provide programmatic access to the media streams. Instead they all offer the ability to transfer a call to a SIP destination/URI. This is in turn can be provided by the Voicegain SIP Media Stream B2BUA. In other words, the Voicegain SIP Media Stream B2BUA can accept a call from such a SIP INVITE.  

More details of the SIP Media Stream B2BUA can be found here

SIPREC from Session Border Controller (currently in Beta)

Most enterprise premise-based Contact Center platforms include a network element called the Session Border Controller (SBC). The SBCs can be thought of as a SIP-aware firewall that is architected "in front" of a premise-based IP Contact Center. SBCs support the forking of audio streams using a protocol called SIPREC and this has been used over the years by active/compliant call recording vendors like NICE and Verint.

With SIPREC, an SBC essentially provides a mirror or fork of the real-time RTP stream from the telephone call. This can be sent to Voicegain's SIPREC Server (currently in beta).

Voicegain has a beta version of a SIPREC interface has been tested with the following platforms:

  • Avaya Enterprise SBC
  • Ribbon/Sonus SBC
  • Broadsoft SIPREC sipua
  • Cisco Cisco Unified Border Element (CUBE)
  • Metaswitch SIPREC sipua - The minimal version of Metaswitch that supports SIPREC is 9.0.10
  • Oracle SBC SIPREC - SelectiveCall Recording SIPREC (oracle.com)
  • Twilio TwiML <Siprec>

Voicegain can capture relevant call metadata in addition to obtaining the audio (the metadata capture functionality may differ in capabilities depending on the client platform).

Voicegain platform can be configured to automatically launch transcription and speech-analytics as soon as the new SIPREC session gets established.

SIPREC support is available both in the Cloud and the Edge (OnPrem) deployments of the Voicegain Platform.

SIPREC is an Enterprise feature of the Voicegain platform and is not included in the base package. Please contact support@voicegain.ai or submit a Zendesk ticket for more information about SIPREC and if you would like to use it with your existing Voicegain account.

Programmable Integration with CCaaS real-time audio streaming APIs

Some CCaaS  platforms, in particular the modern one provide APIs to get programmatic access to the real-time audio stream. In many of them such a capability was added specifically to simplify integration with Cloud Speech-to-Text services.

Examples of such CCaaS platforms are :

  • Five9 VoiceStream
  • Genesys Audiohook
  • Avaya DMCC (which is part of Avaya Aura® Application Enablement (AE) Services) to open RTP streams with the content of the call
  • Use Extended Media Forking (XMF) provided by Cisco Unified Communications Gateway Services

Voicegain Platform integrates with the APIs multiple protocols that allow for flexible programmable integration:

  • websockets - sending binary audio data over websocket is supported. In addition to binary data, message protocols used in Twilio and SignalWire for audio streaming over websocket are also supported. (If required, we can easily add support for additional message protocols.)
  • gRPC - binary audio data may also be sent using gRPC protocol. Note, that this capability is currently in beta.
  • plain RTP. Voicegain also supports plain RTP. The IP/port/encoding negotiation, however, has to be done using our HTTP API. We do not support RTCP nor RTSP. The HTTP API is very simple and we have already had some of our customers integrate this type of plain RTP streaming using XMF within the Cisco UC environment.    

All those protocols support uLaw, aLaw, and Linear 16-bit encoding in either 8- or 16kHz sample rate.

Contact us to discuss or brainstorm!

If you are building a voice Gen AI application and you would like to discuss getting access to realtime audio data, please contact us at support@voicegain.ai

Read more → 
PII Text and Audio Redaction now available in Speech Analytics API
Speech Analytics
PII Text and Audio Redaction now available in Speech Analytics API

Our latest release (1.24.0) expands Voicegain Speech Analytics and Transcription API with ability to redact sensitive data both in transcript and in audio. This allows our customers to be compliant with standards like HIPAA, GDPR, CCPA, PCI or PIPEDA.

Any of the following types of Named Entities can be redacted in transcript text and/or the audio file.

  • ADDRESS - Postal address.
  • CARDINAL - Numerals that do not fall under another type.
  • CC - Credit Card
  • DATE - Absolute or relative dates or periods.
  • EMAIL - (coming soon) Email address
  • EVENT - Named hurricanes, battles, wars, sports events, etc.
  • FAC - Buildings, airports, highways, bridges, etc.
  • GPE - Countries, cities, states.
  • NORP - Nationalities or religious or political groups.
  • MONEY - Monetary values, including unit.
  • ORDINAL - "first", "second", etc.
  • ORG - Companies, agencies, institutions, etc.
  • PERCENT - Percentage, including "%".
  • PERSON - People, including fictional.
  • PHONE - (coming soon) Phone number.
  • QUANTITY - Measurements, as of weight or distance.
  • SSN - Social Security number
  • TIME - Named documents made into laws.
  • ZIP - (coming soon) Zip Code (if not part of an Address)

In the audio they are replaced with silence and in the transcript they are replaced with a string specified when making the API request.

This feature is supported both in Cloud and on the Edge (on-prem).

Two typical use cases are:

  • Enable redaction as part of normal processing, of e.g. call center calls
  • Do a bulk processing of previously underacted audio in storage to achieve compliance. Combined with low per minute price of Voicegain APIs, this allows our customers to cost effectively process large qualities of audio data.  


Read more → 
Voicegain offers Spanish Speech-to-Text
Languages
Voicegain offers Spanish Speech-to-Text

Last week we announced that Spanish Speech-to-Text capability would be available from Voicegain in March. We are pleased to announce today  that we have been able to complete training of the Spanish Neural Network Model earlier than expected and the Spanish Speech-to-Text has been released last Saturday (2/20) as part of our Release 1.24.0.

We have been able to complete work on the Spanish model from start to finish in exactly 3 weeks - we started working on it February 3rd. Such fast progress was possible because of our extensive experience with customization of Neural Network Models for speech recognition and the fact that we have developed advanced tools and proven techniques that make speech-to-text model development and training fast.

The recognition accuracy of the model depends on the type of speech audio. For most benchmark files our Spanish model accuracy is just a few % behind that of  Google or Amazon recognizers. The advantage of our recognizer is the significantly lower price plus ability to train customized acoustic models. Custom models can have accuracy higher than that of Amazon or Google. We encourage you to use our Web Console and/or API to test the real-life performance on your own data. BTW, we are focusing this speech-to-text model on Latin American Spanish.

Of course, Voicegain platform offers other advantages too like support for Edge (on-prem) deployments  and extensive API with many options for out-of-the-box integration into e.g. telephony environments.

Currently, Speech-to-Text API is fully functional with the Spanish Model. Some of the Speech Analytics API functions are not yet available for Spanish, e.g., Named Entity Recognition or Sentiment/Mood detection.

Initially the Spanish Model is available only in the version that supports off-line transcription. Real-time version of the Model will be available in the near future,

To tell the API that you want to use the Spanish Acoustic Model all you need to do is choose it in the Context settings. Spanish models have 'es' in the name, e.g. VoiceGain-ol-es:1

Read more → 
Unique feature: RTP streaming support
Telephony
Unique feature: RTP streaming support

Voicegain speech-to-text platform has supported RTP streaming from the very beginning. One of our first applications, several years ago, was live transcription with ffmpeg utility used to capture audio from a device and to stream it to the Voicegain platform using RTP. Over time we added more robust protocols and RTP was rarely used. However, recently in one of our deployments we came across a use case where RTP streaming allowed our customer to do integration in a very straightforward way within a call-center telephony stack.

Voicegain platform does support more advanced streaming protocols for call-center use like SIPREC or SIP/RTP (SIP Invite). However, in this particular use we were able to stream from Cisco CUBE directly to Voicegain using plain RTP. Upon receiving an incoming call a script is triggered which uses HTTP to establish new Voicegain transcription session. In the session response, ip:port parameters for the RTP receiver specific to the session are returned and these are passed to the CUBE to establish a direct RTP connection.

RTP used like this provides no authentication and security which would make it generally unsuitable for use over Internet. However, in this particular use case our customer benefits from the fact that the entire Voicegain stack can be deployed on-prem. Because of being on the same isolated network as the CUBE there are no issues with security and/or packet loss.  

An example

You can visit out github to see a python code example which shows  how to establish the speech-to-text session, how to point the RTP sender to the receiver endpoint, and how to receive real-time transcription result via a websocket.

The command to establish the session is as simple as this:


Audio section defines the RTP streaming part, and the websocket section defines how the results will be sent back over a websocket.

The response looks like this:

In the github example the stream.ip and stream.port are passed to ffmpeg that is used as the RTP streaming client. The example further illustrates how to process the messages with incremental transcription results sent real-time over the websocket.

Read more → 
Voicegain Speech Analytics API Generally Available
Speech Analytics
Voicegain Speech Analytics API Generally Available

Voicegain has released its Speech Analytics (SA) API that supports variety of analytics tasks performed on the audio or the transcript of that audio. The features supported by Voicegain SA API were chosen to support our target main use case which is processing Call Center calls.


Things that Speech Analytics can do now (from release 1.22.0)

The current release supports offline Speech Analytics. The data that can be obtained through Speech Analytics API is listed below.

Note, here we do not include things that can be obtained also from our Transcribe API, like: transcript, decibel values, audiozones, etc. These, however, will be accessible from the Speech Analytics API response.

Per channel analytics:

  • gender - likely gender of the speaker based on the voice characteristics. Currently either "male" or "female".
  • emotion - Both totals over the entire call and a list of  values computed at multiple places in the transcript. Each item will contain values of: (1) sentiment - from -1.0 (mad/angry) to +1.0 (happy/satisfied)(2) mood - a map with estimated values (range 0.0 to 1.0) for the following moods: "neutral" "calm" "happy" "sad" "angry" "fearful" "disgust" "surprised"(3) location - start and end in msec and index of the word
  • Named Entities recognized in the call. This will be a list with the entity type and the location in the call. NER values that are supported are: CARDINAL - Numerals that do not fall under another type.DATE - Absolute or relative dates or periods.EVENT - Named hurricanes, battles, wars, sports events, etc.FAC - Buildings, airports, highways, bridges, etc.GPE - Countries, cities, states.NORP - Nationalities or religious or political groups.MONEY - Monetary values, including unit.ORDINAL - "first", "second", etc.ORG - Companies, agencies, institutions, etc.PERCENT - Percentage, including "%".PERSON - People, including fictional.QUANTITY - Measurements, as of weight or distance.TIME - Named documents made into laws.
  • keywords - list of keywords or keyword groups recognized in the call. Keywords to be recognized can easily to configured from examples.
  • profanity - this is essentially a predefined keyword group
  • talk metrics - things like maximum and average talk streak, talk rate, energy
  • overtalk metrics - overtalk happens if the speaker starts speaking while the other speaker is already speaking.

Global analytics:

  • silence metrics - Defined as time when none of the channels is speaking. Note: Only the Agent is assumed to be in control of the speaking time. This a simplification, but it is difficult to determine of any silence was caused by the caller and was unavoidable.
  • word cloud frequencies - smart word cloud data with stop words removed and word variations collapsed before computing frequencies

Speech Analytics features coming soon

Real-time Speech Analytics will be available in the near future. Soon we also plan to release Score Card support for Speech Analytics.

Per channel analytics coming soon:

  • Two additional named entities: CC - Credit Card,SSN - Social Security number
  • age - estimated age of the speaker based on the voice characteristics. Three possible values: "young-adult" "senior" "unknown"
  • phrases - list of phrases or phrase groups recognized in the call. These are identified using NLU algorithms - essentially the same as used for identifying NLU intents. Phrases to be recognized can be configured from examples.
  • pitch statistics will be added to talk metrics

Additionally, we will soon support PII redaction of any named entity from either transcript or audio.

Supported audio types

Speech Analytics API supports the following types of audio input:

  • 2-channel (stereo) audio as typically found in call centers where the Caller voice is recorded in one channel and the Agent voice is recorded in the other channel. Some metrics, like overtalk e.g., can only be computed if the input audio is of this type.
  • 1 channel audio with two speakers - for this audio type diarization will be performed to separate the two speakers. The per-channels analytics will be performed after diarization. Overtalk metrics are not available for this use case.

You can see the API specification here.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control