Our Blog

News, Insights, sample code & more!

ASR
Announcing the launch of Voicegain Whisper ASR/Speech Recognition API for Gen AI developers

Today we are really excited to announce the launch of Voicegain Whisper, an optimized version of Open AI's Whisper Speech recognition/ASR model that runs on Voicegain managed cloud infrastructure and accessible using Voicegain APIs. Developers can use the same well-documented robust APIs and infrastructure that processes over 60 Million minutes of audio every month for leading enterprises like Samsung, Aetna and other innovative startups like Level.AI, Onvisource and DataOrb.

The Voicegain Whisper API is a robust and affordable batch Speech-to-Text API for developersa that are looking to integrate conversation transcripts with LLMs like GPT 3.5 and 4 (from Open AI) PaLM2 (from Google), Claude (from Anthropic), LLAMA 2 (Open Source from Meta), and their own private LLMs to power generative AI apps. Open AI open-sourced several versions of the Whisper models released. With today's release Voicegain supports Whisper-medium, Whisper-small and Whisper-base. Voicegain now supports transcription in over multiple languages that are supported by Whisper. 

Here is a link to our product page


There are four main reasons for developers to use Voicegain Whisper over other offerings:

1. Support for Private Cloud/On-Premise deployment (integrate with Private LLMs)

While developers can use Voicegain Whisper on our multi-tenant cloud offering, a big differentiator for Voicegain is our support for the Edge. The Voicegain platform has been architected and designed for single-tenant private cloud and datacenter deployment. In addition to the core deep-learning-based Speech-to-text model, our platform includes our REST API services, logging and monitoring systems, auto-scaling and offline task and queue management. Today the same APIs are enabling Voicegain to processes over 60 Million minutes a month. We can bring this practical real-world experience of running AI models at scale to our developer community.

Since the Voicegain platform is deployed on Kubernetes clusters, it is well suited for modern AI SaaS product companies and innovative enterprises that want to integrate with their private LLMs.

2. Affordable pricing - 40% less expensive than Open AI 

At Voicegain, we have optimized Whisper for higher throughput. As a result, we are able to offer access to the Whisper model at a price that is 40% lower than what Open AI offers.

3. Enhanced features for Contact Centers & Meetings.

Voicegain also offers critical features for contact centers and meetings. Our APIs support two-channel stereo audio - which is common in contact center recording systems. Word-level timestamps is another important feature that our API offers which is needed to map audio to text. There is another feature that we have for the Voicegain models - enhanced diarization models - which is a required feature for contact center and meeting use-cases - will soon be made available on Whisper.

4. Premium Support and uptime SLAs.

We also offer premium support and uptime SLAs for our multi-tenant cloud offering. These APIs today process over 60 millions minutes of audio every month for our enterprise and startup customers.

About OpenAI-Whisper Model

OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture of the model is based on encoder-decoder transformers system and has shown significant performance improvement compared to previous models because it has been trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

OpenAI Whisper model encoder-decoder transformer architecture

Source

Getting Started with Voicegain Whisper

Learn more about Voicegain Whisper by clicking here. Any developer - whether a one person startup or a large enterprise - can access Voicegain Whisper model by signing up for a free developer account. We offer 15,000 mins of free credits when you sign up today.

There are two ways to test Voicegain Whisper. They are outlined here. If you would like more information or if you have any questions, please drop us an email support@voicegain.ai

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Easy Speech IVR for Outbound Calling using Voicegain and Twilio
Contact Center
Easy Speech IVR for Outbound Calling using Voicegain and Twilio

Outbound IVRs on Voicegain

Voicegain platform makes it easy to build IVRs for simple outbound calling applications like: surveys (Voice-of-Customer, political, etc), reminders (e.g. appointments, payments due), notifications (e.g. school closure, water boil notice), and so on.

Voicegain allows developers to use the outbound calling features of CPaaS platforms like Twilio or SignalWire with the speech recognition and IVR features of the Voicegain platform. All you need is this simple piece of code to make an outbound call using Twilio and connect it to Voicegain for IVR.


Defining IVRs in declarative way

Voicegain provides a full featured Telephone Bot API. It is a webhook/callback style API that can be used in similar way you would use Twilio's TwiML. You can read more about it here

However, in this post, we describe an even simpler method to build IVRs. We allow developers to specify the Outbound IVR call flow definitions in a simple YAML format. We also provide a python script that can be easily deployed on AWS Lambda or on your web-server to interpret this YAML file. The complete code with examples can be found on our github. It is under MIT license so you can modify the main interpreter script to your liking. You might want to do it e.g. to make calls to external webservices that your IVR needs.

In this YAML format, an IVR question would be defined as follows:


As you can see, this is a pretty easy way to define an IVR question. Notice also that we provide a built-in handling for the NOINPUT and NOMATCH re-prompts, as well as the logic for confirmations. This greatly reduces the the clutter in the specification as those flow scenarios do not have to be handled explicitly.

The questions support either use of grammars to map responses to semantic meaning, or they can alternatively simply capture the response using a large vocabulary transcription.

Prompts are played using TTS or can be concatenated from prerecorded clips.

Wait, there is more.

Because this is built on top of Voicegain Telephone Bot API it comes with full API access to the IVR call session. You can obtain details, including all the events and responses, of the complete session using the API. This includes the 2-channel recording plus also full transcription of both channels and also Speech Analytics features.

You can also examine the details of the session from the Voicegain Console and listen to the audio. This helps in testing the application before it gets deployed.  




If you have questions about building this type of IVRs running on Voicegain platform, please contact us at support@voicegain.ai

Read more → 
Voicegain Speech Recognition for Voice Picking for Warehouses
Use Cases
Voicegain Speech Recognition for Voice Picking for Warehouses

Among the various speech-to-text APIs that Voicegain provides is a speech recognition API that uses grammars and supports continuous recognition. This API is ideally suitable for use in warehouse Voice Picking applications. Warehouse Management Systems can embed Voicegain APIs to offer Voice Picking as part of their feature set.

Here are more details of that specific API:

  • Audio input - supports streaming of audio via websockets for very easy integration with web based or Android/iOS applications (gRPC support is in beta)
  • Results of recognition are available via websocket or http callbacks in JSON format. Sending recognition results over websockets is a recent addition and it makes building web based voice picking applications much easier.
  • Supports grammar based recognition - better suited for a well defined set of commands compared to large-vocabulary speech-to-text. Has higher accuracy, better noise rejection, better handling of various accents, etc. Using grammas provides a benefit of fast end-pointing - the recognizer knows that the command has been completely uttered and there is no additional timeout needed to determine end-of-speech.We support a variant of JSGF grammar format which is very intuitive and easy to use.
  • Supports continuous recognition - multiple commands can be recognized in a single http session. Continuous recognition allows for the commands to be spaced closer together and allows for natural correction of misrecognitions by simple repetition.

In addition to that Voicegain Speech-to-Text platform provides additional benefits for Voice Picking applications:

  • Acoustic/language model is customizable - this allows for very high recognition accuracy for specific domains
  • Web-based tools available for reviewing utterance recognitions. These tools allow for tuning of grammars and for collection of utterances for model training.

Together this allows for your Voice Picking application to continually learn and improve.

Our APIs are available in the Cloud but can also be hosted at the Edge (on-prem) which can increase reliability and reduce the already low latencies.

If you would like to test our API and see how they would fit in your warehouse applications you can start with the fully functional example web app that we have made available on github: platform/examples/command-grammar-web-app at master · voicegain/platform (github.com)

If you have any question please email us at info@voicegain.ai. You can also sign-up for a free account on Voicegain Platform via our Web Console at: https://console.voicegain.ai/signup  

Read more → 
Forking Media Streams from Contact Center Platforms for Realtime Transcription
Developers
Forking Media Streams from Contact Center Platforms for Realtime Transcription

Overview

Voicegain Real-Time Transcription and Speech-Analytics APIs can get access to the streaming audio data in real-time from IP Telephony / Unified Communications systems (e.g. from Avaya, Cisco, Genesys) using 3 approaches:

  • SIPREC
  • SIP INVITE
  • Programmable integration (using APIs)

The details of each of those approaches are described below

Use Cases

The use cases for Realtime Transcription and Speech Analytics APIs are as follows

  1. Realtime Agent Assist in Contact Centers for customer service
  2. Realtime assist for sales people (SDRs, Sales Engineers, AEs) for telephony conversations and meetings
  3. Realtime Insights from internal meetings

The Transcription APIs convert audio into text real-time. The Speech Analytics APIs offer analytics both from Text - NLU Intents, sentiment, entities and keywords and Audio - Tone, Silence, OverTalk etc.


SIPREC

SIPREC is usually used for call recording but the standard essentially provides a real-time audio stream from the telephone call which makes it suitable for applications which have to work real-time.


Voicegain SIPREC interface has been tested with the following platforms:

  • Avaya Enterprise SBC - also supports Avaya AES/TSAPI integration for more call metadata
  • Broadsoft SIPREC sipua
  • Cisco built-in bridge (BIB) - built-in bridge functionality is available on some of the Cisco's 3rd generation VoIP Phones and supported by Cisco's UCM version 6.0 and higher.  
  • Cisco Cisco Unified Border Element (CUBE)
  • Metaswitch SIPREC sipua - The minimal version of Metaswitch that supports SIPREC is 9.0.10
  • Oracle SBC SIPREC - SelectiveCall Recording SIPREC (oracle.com)
  • Twilio TwiML <Siprec>

Voicegain can capture relevant call metadata in addition to obtaining the audio (the metadata capture functionality may differ in capabilities depending on the client platform).

Voicegain platform can be configured to automatically launch transcription and speech-analytics as soon as the new SIPREC session gets established.

SIPREC support is available both in the Cloud and the Edge (OnPrem) deployments of the Voicegain Platform.

SIPREC is an Enterprise feature of the Voicegain platform and is not included in the base package. Please contact support@voicegain.ai or submit a Zendesk ticket for more information about SIPREC and if you would like to use it with your existing Voicegain account.

SIP INVITE

Certain platforms, like Genesys for example, do not support SIPREC. Instead they may offer ability to send separate- or combined-channel audio stream to a destination negotiated using a SIP INVITE. The Genesys platform, for example,  does support streaming of the inbound and outbound RTP media to two separate SIP endpoints.

Voicegain Platform allows you to define SIP addresses that will accept such SIP INVITE.  As part of the SIP INVITE custom sip headers may be sent to provide information that allows for session tie-up and may pass any additional metadata. Upon establishing SIP connection, Voicegain will make an HTTP callback to a specified endpoint to acknowledge the connection and pass all the connection data.

Programmable Integration

Some UC platforms, in particular the newer versions provide additional capabilities to get access to the real-time audio stream. In many of them such a capability was added specifically to simplify integration with Cloud Speech-to-Text services.

Examples of that type of integration are:

  • Use Avaya DMCC (which is part of Avaya Aura® Application Enablement (AE) Services) to open RTP streams with the content of the call
  • Use Extended Media Forking (XMF) provided by Cisco Unified Communications Gateway Services
  • Five9 VoiceStream

Voicegain Platform provides multiple protocols that allow for flexible programmable integration:

  • websockets - sending binary audio data over websocket is supported. In addition to binary data, message protocols used in Twilio and SignalWire for audio streaming over websocket are also supported. (If required, we can easily add support for additional message protocols.)
  • gRPC - binary audio data may also be sent using gRPC protocol. Note, that this capability is currently in beta.
  • plain RTP. Voicegain also supports plain RTP. The IP/port/encoding negotiation, however, has to be done using our HTTP API. We do not support RTCP nor RTSP. The HTTP API is very simple and we have already had some of our customers integrate this type of plain RTP streaming using XMF within the Cisco UC environment.    

All those protocols support uLaw, aLaw, and Linear 16-bit encoding in either 8- or 16kHz sample rate.

Interested in Voicegain? Take us for a test drive!

1. Click here for instructions to access our live demo site.

2. If you are building a cool voice app and you are looking to test our APIs, click hereto sign up for a developer account  and receive $50 in free credits.

3. If you want to take Voicegain as your own AI Transcription Assistant to meetings, click here.



Read more → 
PII Text and Audio Redaction now available in Speech Analytics API
Speech Analytics
PII Text and Audio Redaction now available in Speech Analytics API

Our latest release (1.24.0) expands Voicegain Speech Analytics and Transcription API with ability to redact sensitive data both in transcript and in audio. This allows our customers to be compliant with standards like HIPAA, GDPR, CCPA, PCI or PIPEDA.

Any of the following types of Named Entities can be redacted in transcript text and/or the audio file.

  • ADDRESS - Postal address.
  • CARDINAL - Numerals that do not fall under another type.
  • CC - Credit Card
  • DATE - Absolute or relative dates or periods.
  • EMAIL - (coming soon) Email address
  • EVENT - Named hurricanes, battles, wars, sports events, etc.
  • FAC - Buildings, airports, highways, bridges, etc.
  • GPE - Countries, cities, states.
  • NORP - Nationalities or religious or political groups.
  • MONEY - Monetary values, including unit.
  • ORDINAL - "first", "second", etc.
  • ORG - Companies, agencies, institutions, etc.
  • PERCENT - Percentage, including "%".
  • PERSON - People, including fictional.
  • PHONE - (coming soon) Phone number.
  • QUANTITY - Measurements, as of weight or distance.
  • SSN - Social Security number
  • TIME - Named documents made into laws.
  • ZIP - (coming soon) Zip Code (if not part of an Address)

In the audio they are replaced with silence and in the transcript they are replaced with a string specified when making the API request.

This feature is supported both in Cloud and on the Edge (on-prem).

Two typical use cases are:

  • Enable redaction as part of normal processing, of e.g. call center calls
  • Do a bulk processing of previously underacted audio in storage to achieve compliance. Combined with low per minute price of Voicegain APIs, this allows our customers to cost effectively process large qualities of audio data.  


Read more → 
Voicegain offers Spanish Speech-to-Text
Languages
Voicegain offers Spanish Speech-to-Text

Last week we announced that Spanish Speech-to-Text capability would be available from Voicegain in March. We are pleased to announce today  that we have been able to complete training of the Spanish Neural Network Model earlier than expected and the Spanish Speech-to-Text has been released last Saturday (2/20) as part of our Release 1.24.0.

We have been able to complete work on the Spanish model from start to finish in exactly 3 weeks - we started working on it February 3rd. Such fast progress was possible because of our extensive experience with customization of Neural Network Models for speech recognition and the fact that we have developed advanced tools and proven techniques that make speech-to-text model development and training fast.

The recognition accuracy of the model depends on the type of speech audio. For most benchmark files our Spanish model accuracy is just a few % behind that of  Google or Amazon recognizers. The advantage of our recognizer is the significantly lower price plus ability to train customized acoustic models. Custom models can have accuracy higher than that of Amazon or Google. We encourage you to use our Web Console and/or API to test the real-life performance on your own data. BTW, we are focusing this speech-to-text model on Latin American Spanish.

Of course, Voicegain platform offers other advantages too like support for Edge (on-prem) deployments  and extensive API with many options for out-of-the-box integration into e.g. telephony environments.

Currently, Speech-to-Text API is fully functional with the Spanish Model. Some of the Speech Analytics API functions are not yet available for Spanish, e.g., Named Entity Recognition or Sentiment/Mood detection.

Initially the Spanish Model is available only in the version that supports off-line transcription. Real-time version of the Model will be available in the near future,

To tell the API that you want to use the Spanish Acoustic Model all you need to do is choose it in the Context settings. Spanish models have 'es' in the name, e.g. VoiceGain-ol-es:1

Read more → 
Unique feature: RTP streaming support
Telephony
Unique feature: RTP streaming support

Voicegain speech-to-text platform has supported RTP streaming from the very beginning. One of our first applications, several years ago, was live transcription with ffmpeg utility used to capture audio from a device and to stream it to the Voicegain platform using RTP. Over time we added more robust protocols and RTP was rarely used. However, recently in one of our deployments we came across a use case where RTP streaming allowed our customer to do integration in a very straightforward way within a call-center telephony stack.

Voicegain platform does support more advanced streaming protocols for call-center use like SIPREC or SIP/RTP (SIP Invite). However, in this particular use we were able to stream from Cisco CUBE directly to Voicegain using plain RTP. Upon receiving an incoming call a script is triggered which uses HTTP to establish new Voicegain transcription session. In the session response, ip:port parameters for the RTP receiver specific to the session are returned and these are passed to the CUBE to establish a direct RTP connection.

RTP used like this provides no authentication and security which would make it generally unsuitable for use over Internet. However, in this particular use case our customer benefits from the fact that the entire Voicegain stack can be deployed on-prem. Because of being on the same isolated network as the CUBE there are no issues with security and/or packet loss.  

An example

You can visit out github to see a python code example which shows  how to establish the speech-to-text session, how to point the RTP sender to the receiver endpoint, and how to receive real-time transcription result via a websocket.

The command to establish the session is as simple as this:


Audio section defines the RTP streaming part, and the websocket section defines how the results will be sent back over a websocket.

The response looks like this:

In the github example the stream.ip and stream.port are passed to ffmpeg that is used as the RTP streaming client. The example further illustrates how to process the messages with incremental transcription results sent real-time over the websocket.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control