Our Blog

News, Insights, sample code & more!

ASR
Announcing the launch of Voicegain Whisper ASR/Speech Recognition API for Gen AI developers

Today we are really excited to announce the launch of Voicegain Whisper, an optimized version of Open AI's Whisper Speech recognition/ASR model that runs on Voicegain managed cloud infrastructure and accessible using Voicegain APIs. Developers can use the same well-documented robust APIs and infrastructure that processes over 60 Million minutes of audio every month for leading enterprises like Samsung, Aetna and other innovative startups like Level.AI, Onvisource and DataOrb.

The Voicegain Whisper API is a robust and affordable batch Speech-to-Text API for developersa that are looking to integrate conversation transcripts with LLMs like GPT 3.5 and 4 (from Open AI) PaLM2 (from Google), Claude (from Anthropic), LLAMA 2 (Open Source from Meta), and their own private LLMs to power generative AI apps. Open AI open-sourced several versions of the Whisper models released. With today's release Voicegain supports Whisper-medium, Whisper-small and Whisper-base. Voicegain now supports transcription in over multiple languages that are supported by Whisper. 

Here is a link to our product page


There are four main reasons for developers to use Voicegain Whisper over other offerings:

1. Support for Private Cloud/On-Premise deployment (integrate with Private LLMs)

While developers can use Voicegain Whisper on our multi-tenant cloud offering, a big differentiator for Voicegain is our support for the Edge. The Voicegain platform has been architected and designed for single-tenant private cloud and datacenter deployment. In addition to the core deep-learning-based Speech-to-text model, our platform includes our REST API services, logging and monitoring systems, auto-scaling and offline task and queue management. Today the same APIs are enabling Voicegain to processes over 60 Million minutes a month. We can bring this practical real-world experience of running AI models at scale to our developer community.

Since the Voicegain platform is deployed on Kubernetes clusters, it is well suited for modern AI SaaS product companies and innovative enterprises that want to integrate with their private LLMs.

2. Affordable pricing - 40% less expensive than Open AI 

At Voicegain, we have optimized Whisper for higher throughput. As a result, we are able to offer access to the Whisper model at a price that is 40% lower than what Open AI offers.

3. Enhanced features for Contact Centers & Meetings.

Voicegain also offers critical features for contact centers and meetings. Our APIs support two-channel stereo audio - which is common in contact center recording systems. Word-level timestamps is another important feature that our API offers which is needed to map audio to text. There is another feature that we have for the Voicegain models - enhanced diarization models - which is a required feature for contact center and meeting use-cases - will soon be made available on Whisper.

4. Premium Support and uptime SLAs.

We also offer premium support and uptime SLAs for our multi-tenant cloud offering. These APIs today process over 60 millions minutes of audio every month for our enterprise and startup customers.

About OpenAI-Whisper Model

OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture of the model is based on encoder-decoder transformers system and has shown significant performance improvement compared to previous models because it has been trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

OpenAI Whisper model encoder-decoder transformer architecture

Source

Getting Started with Voicegain Whisper

Learn more about Voicegain Whisper by clicking here. Any developer - whether a one person startup or a large enterprise - can access Voicegain Whisper model by signing up for a free developer account. We offer 15,000 mins of free credits when you sign up today.

There are two ways to test Voicegain Whisper. They are outlined here. If you would like more information or if you have any questions, please drop us an email support@voicegain.ai

Read more → 
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Voicegain Speech-to-Text/ASR deployable on AWS VPC
Edge
Voicegain Speech-to-Text/ASR deployable on AWS VPC

The entire Voicegain Speech-to-Text/ASR platform and all the associated products - ranging from Web Speech-to-Text (STT) APIs, Speech Analytics APIs, Telephony Bot APIs and the MRCP ASR engine and our logging and monitoring framework - can deployed on "the Edge".

By "Edge" we mean that the core Deep Neural Network based AI models that convert speech/audio into text run exclusively on hardware deployed in a client datacenter. Or after this announcement they can also run on a compute instance in a Virtual Private Cloud. In either case, the Voicegain platform is "orchestrated" using the Voicegain Console which is a web application that is deployed on Voicegain cloud.

On the Edge, the Voicegain platform gets deployed as a container on a Kubernetes Cluster.  Voicegain can also be accessed as a Cloud service if clients would not want to manage either server hardware or VPC compute instances.

Edge Deployment in Datacenter/On-Premise

Voicegain platform has always been deployable in a simple and automated way on hardware in a datacenter. Our clients procure compatible Intel Xeon based servers with Nvidia based GPUs. And they are able to install the entire Voicegain platform a few clicks from the Cloud Portal (see these videos for demonstration).

You can read about the advantages of this Datacenter type of Edge Deployment in our previous blog post. To summarize, these advantages are :

  1. Low Network Latencies & High Network Reliability
  2. Lower Bandwidth Cost
  3. Data Privacy and Control
  4. Lower Computing Resource Cost

Now these benefits shall also be available for enterprise clients that use a Virtual Private Cloud on AWS to run a portion of their enterprise workloads.




Edge Deployment on AWS "Private Cloud"

Many enterprises have migrated several enterprise workloads to AWS Cloud infrastructure in order to benefit from the scale, flexibility and ease of maintenance. While moving these workloads to the Cloud, these enterprises largely prefer the private cloud offerings of AWS. e.g., using VPC network isolation, Site-to-Site VPN and dedicated compute instances. For these enterprises, ideally any new workload should be capable of being run inside their AWS VPC. In particular, if an enterprise already has dedicated AWS compute instances or hosts, they could realize all of the above 4 advantages of Edge Deployment by deploying into their dedicated AWS infrastructure.

Voicegain Platform now deployable on AWS VPC

Recently, anticipating interest of some of our customers we have performed extensive tests of complete deployment of our platform into AWS. Because Voicegain platform is Kubernetes based, there are essentially only two differences from deployment onto local on-premise hardware – these are:  

  • (rather obviously) How you prepare and setup the K8s cluster in particular users and roles – for security you will want to keep the Voicegain cluster separated this way from the rest of your AWS infrastructure.
  • How you enable network access to the provisioned deployment - you will need to modify inbound rules in the Security Group for the cluster, rather than modifying settings on your router/firewall.

Otherwise, the core of the deployment process is pretty much identical between on premise hardware and AWS VPC.

You can read the details involved in the AWS deployment process on Voicegain's github page.

Sign up for a developer account on Voicegain.

If you are a developer building something that requires you to add or embed  Speech-to-Text functionality (Transcription, Voice Bot or Speech Analytics in Contact Centers, analyzing meetings or sales calls, etc.), we invite you to give Voicegain a try. You can start by signing up for a developer account and use our Free tier. You can also email us at info@voicegain.ai.

Read more → 
Accurate & Affordable Speech-to-Text for SignalWire developers
CPaaS
Accurate & Affordable Speech-to-Text for SignalWire developers

This blog post describes how SignalWire developers should integrate with Voicegain Speech-to-Text/ASR based on the application that they are building.

Voicegain offers a highly accurate Speech-to-Text/ASR option on SignalWire. Voicegain is  very disruptively priced and one of the main benefits is that it allows developers to customize the underlying acoustic models to achieve very high accuracy for specific accents or industry verticals.

#1:  Real-time Transcription and Speech Analytics using LaML <Stream>

SignalWire developers can fork audio to Voicegain using the <Stream> instruction in LAML. The <Stream> instruction makes it possible to send raw audio streams from a running phone call over WebSockets in near real time, to a specified URL.

Developers looking to just get the raw text/transcript may use the Voicegain STT API to get real time transcription of streamed audio from SignalWire.

For developers that need NLU tags like sentiment, named entities, intents and keywords in addition to the transcript, Voicegain's Speech Analytics API provides those metrics in addition to the transcript.

Applications of real-time transcription and speech analytics include live agent assist in contact centers, extraction of insights for sales calls conducted over telephony, and meeting analytics.

#2: Voice Bot or Directed Dialog Speech IVR

If you want to build a Voice Bot or a directed dialog Speech IVR application that handles calls coming over SignalWire then we suggest using Voicegain Telephony Bot API. This is a web callback API similar to LaML and has instructions or  commands specifically helpful in building IVRs or Voice Bots. This API handles speech-to-text, DTMF digits and also plays prompts (TTS, pre-recorded, or a combination).

Your calls are transferred from SignalWire to a Voicegain provided SIP endpoint (based on FreeSWITCH) using a simple SIP INVITE.

Voicegain Telephony Bot API allows you to build two types of applications:

  • Voice Bot applications using a Bot framework of your choice. The "ear" and "mouth" of the bot is provided by Voicegain where as the Bot Framework manages the dialog and extracts intents from the transcribed text. This blog post describes how you can build a Voice Bot using the RASA Bot Framework.
  • Directed Dialog IVRs using call flows and grammars. You can either program them directly using Telephony Bot API by implementing appropriate callbacks. Alternatively, we provide a simple script that allows you to specify the entire IVR application in a declarative way in a YAML file. You can find a complete example how to do this on our github: platform/declarative-ivr at master · voicegain/platform (github.com)

#3: Custom Applications  

If your application has only a limited need for speech recognition, you can invoke Voicegain STT API only as needed. Every time you need speech recognition in your application, you can simply start a new ASR session with Voicegain either in transcribe (large vocabulary transcription) or recognize (grammar-based recognition) mode. You can use the LAML <stream> command

An example application that could be a voice controlled voicemail retrieval  or dictation application where Voicegain recognize API is used in a continuous mode and listens to commands like play, stop, next, etc.

In addition to SignalWire, Voicegain also offers integrations with FreeSWITCH using the built-in mrcp plug-in and a separate module for real-time transcription.

If you are a developer on SignalWire and would like to build an app that requires Speech-to-Text/ASR, you can sign up for a developer account using the instructions provided here.

Read more → 
Four ways to use Voicegain Speech-to-Text with Telnyx
CPaaS
Four ways to use Voicegain Speech-to-Text with Telnyx

This blog post will describe 4 ways you can use Telnyx with the Voicegain's Deep Neural Network based Speech-to-Text/ASR platform.

#1: Real-time Transcription and Speech-Analytics

For developers looking to get the raw text/transcript, the Voicegain STT API supports real time transcription of streamed audio from Telnyx.

For conversational AI applications that need NLU tags like sentiment, named entities, intents and keywords in the submitted audio, Voicegain's real-time Speech Analytics API provides those metrics in addition to the transcript.

While both the STT API and the Speech Analytics API support multiple methods to stream audio, Voicegain recommends RTP streaming as the primary method with Telnyx. Developers can stream either 1-channel or 2-channel RTP (the two channels are tied together which is important for some Speech Analytics features).

You can use Telnyx Call Control API to fork the call audio and send it to Voicegain. Call Control API allows you to send either inbound (rx) or outbound (tx) audio or both, this is done using the fork_start command. You can find a complete example of a code needed for real-time transcription of a call here: platform/examples/telnyx/call_control_fork_of_bridged_call at master · voicegain/platform (github.com)

Applications of real-time transcription and speech analytics include live agent assist in contact centers, extraction of insights for sales calls conducted over telephony, and meeting analytics.

#2: Voice Bot or IVR using Voicegain Telephony Bot API

If you want to build a Voice Bot or an IVR application that handles calls coming over Telnyx then we suggest using Voicegain Telephony Bot API - this is a callback API similar in style to Twilio's TwiML. This API handles speech-to-text, DTMF digits and also plays prompts (TTS, pre-recorded, or a combination).

Your calls are transferred from Telnyx to Voicegain using a simple SIP INVITE. The SIP INVITE is accomplished using Telnyx Call Control Dial command. You can find a complete example how to do this here: platform/telnyx-dial-outbound-lambda.py at master · voicegain/platform (github.com)

Voicegain Telephony Bot API allows you to build two types of applications:

  • Voice Bot applications using either your own Bot framework or using frameworks like RASA or Google Dialog flow for the bot logic. The "ear" and "mouth" of the bot is provided by Voicegain. This blog shows how a Voice Bot using RASA can be constructed: Easy How-To: Build a Voicebot using Voicegain, RASA, and AWS Lambda.
  • Alternatively, you can build more traditional IVRs using call flows and grammars. You can either program them directly using Telephony Bot API by implementing appropriate callbacks. Alternatively, we provide a simple script that allows you to specify the entire IVR application in a declarative way in a YAML file. You can find a complete example how to do this on our github: platform/declarative-ivr at master · voicegain/platform (github.com)

#3: Use Voicegain STT API as needed in your Call Control App

If your application has only a limited need for speech recognition, you can invoke Voicegain STT API only as needed. Every time you need speech recognition you simply start a new ASR session with Voicegain either in transcribe (large vocabulary transcription) or recognize (grammar-based recognition) mode. The session will return an RTP ip:port to which you can fork your Telnyx audio. You can receive speech-to-text results either over a websocket or via a callback. After you a done with the transcription/recognition session you stop the Telnyx audio fork.

An example application that could be built like that is a voice controlled voicemail retrieval application where Voicegain recognize API is used in a continuous mode and listens to commands like play, stop, next, etc.

#4: Build your own Voice Bot using Long-Session STT API

Finally, you could use Voicegain Long-Session API (planned to be released later in 2021). This API allows you to establish single long session that takes an ongoing stream of inbound audio from Telnyx (via fork command). Once the session is established you can issue commands for transcription or recognition. They would return results upon finding a speech endpoint or when you explicitly stop them. After processing the results you could issue additional commands on the same Voicegain session.

In addition to returning recognition results, Long-Session STT API returns important events, like e.g. start-of-speech that allows you to implement proper barge-in behavior.

Using this API you could build your own Voice Bot just like the Voice Bots from #2, but you could have more control over your Telnyx session, e.g. you could use conference commands.

Read more → 
Speech Analytics Comparison: NER Capabilities & Accuracy
Benchmark
Speech Analytics Comparison: NER Capabilities & Accuracy

This post is the first in a series of posts that compares the performance of Voicegain Speech Analytics against Google and Amazon. This post compares the capabilities and accuracy of recognition/extraction of Named Entities. The Google APIs used for comparison were those under Cloud Natural Language and the Amazon APIs were under AWS Comprehend.

Named Entity Recognition (NER) or extraction of Named Entities is a one of the features of the Voicegain Speech Analytics API. Named Entities Recognition locates and classifies named entities in unstructured text that may be obtained e.g. from the transcription of the audio files. Although there is a lot of overlap between Google, Amazon and Voicegain with respect to the classification categories, there are also some significant differences which are summarized below.

Supported NER Categories



The full spreadsheet linked here shows the named entities extracted by the Voicegain Speech Analytics API and it compares them to the named entity categories available in Google and Amazon Comprehend APIs. Amazon has two NER API: Entity, and PII Entity.

If you look at the spreadsheet you will see that Amazon non-PII Entity API offers little granularity in the named entity categories.  For example, it groups a lot of numerical named entities into single QUANTITY category. It groups dates and time (of day) into a single category DATE. On the other hand then PII Entity API has a lot of fine categories related items typically PII-redacted, but it misses a lot of other common entity categories.

Google API seems to cover the usual categories but misses some entities used in call-center application, e.g. CC, SNN, EMAIL>

A category that Voicegain does not support is OTHER. This category  which is available in Google and Amazon requires additional application logic to interpret the string that it matches.

Accuracy Comparison

We have tested all 4 APIs on a set of call center calls.

The overall results show that Voicegain and Amazon non-PII PAI detect similar named entities (with the caveat that Amazon NER categories are less specific). Compared to these two, Google NER API misses more entities, but it also marks many additional words falling into the OTHER categories (which is generally is not very useful, at least not when analyzing call center calls.  

Looking at the Amazon PII Entities we noticed that:

  • was good on NAME, BANK_ACCOUNT_NUMBER
  • EMAIL and PHONE worked mostly OK, but had some strange false positives
  • CREDIT_DEBIT_NUMBER had false positives (e.g. from phone) or partial matches
  • DATE_TIME was not picking all phrases that the description said this category should recognize
  • ADDRESS was working with mixed success - sometimes not picking clear address text or recognizing only part of it
  • EXPIRY_DATE had many false positives - combinations of 4 digits that clearly were not valid expiry dates

Where Voicegain has a matching entity category for AWS PII Entity it performed same or better.As you see it is difficult to summarize the results because the entities are not directly comparable. If you want to know how Voicegain NER will perform on your data we suggest you test the Voicegain Speech Analytics API which includes NER, keyword, phrase detection, sentiment analysis, etc.

For testing, you have two options:

  1. You can create a free developer account on the Voicegain Platform. Here is how you can sign up. Once you sign up, please use the Transcribe+ feature. If you have any questions, please email us at support@voicegain.ai
  2. You can also use the beta version of our Speech Analytics app and upload your 2 channel audio recording. To get access, please email us at support@voicegain.ai
Read more → 
SIP INVITE Voicegain from Twilio, SignalWire, Telnyx CPaaS
CPaaS
SIP INVITE Voicegain from Twilio, SignalWire, Telnyx CPaaS

Voicegain Telephony Bot API allows developers to use Voicegain Speech-to-Text to build Voice Bots or programmable speech IVR using a simple callback API. With latest Voicegain Platform release 1.21.0 it is now possible to establish SIP sessions to Voicegain Telephony Bot API using a simple SIP Invite.


Before release 1.21.0, the only way for voice app developers to use the Voicegain Telephony Bot API was to call the application using phone numbers that were purchased from Voicegain (via the Web Console). However, we have always wanted to allow clients to bring their own carrier or CPaaS platform and this release allows developers to do just that.


At Voicegain our focus is on offering our ASR/Speech Recognition functionality and our full featured Speech-to-Text APIs. We understand developers rely on their CPaaS platforms for a whole host of important features - messaging, emails, conferencing and international coverage. Now, it is possible to integrate Voicegain Telephony Bot API with any CPaaS that supports SIP Invite. You can combine powerful and affordable Speech Recognition features of the Voicegain Platform with the comprehensive  API features of these CPaaS platforms


We have already tested SIP Invite extensively on Twilio, SignalWire, and Telnyx platforms. Other similar platforms should also work without issues. We will report any additional platforms that we have explicitly tested in the future.


How SIP INVITE works with Twilio & SignalWire

On Twilio and SignalWire platforms is trivial to establish SIP session to Voicegain. The only thing needed is the <Dial><Sip> command from TwiML or LaML, for example:



Some notes about the above example:

  • The SIP URI user name is a unique random identifier assigned on Voicegain Platform to each Telephony Bot Application.
  • After the SIP connection gets established, the application prompts and speech recognition will be under control of Voicegain Platform based on commands passed using our Telephony Bot API
  • Once Voicegain `disconnect` command is issued, the control of the application flow will be returned back to the host platform (i.e. Twilio, SignalWire or any other CPaaS platform).
  • It is possible to pass custom headers to Voicegain during SIP Invite - this way it is possible to associate host sessions with Voicegain sessions.
  • It is possible to make multiple <Dial><Sip> requests to Voicegain from host application during a single host session.

On our github you can find sample code showing how to dial a outbound call and then bridge it to Voicegain SIP:

What about Telnyx

On Telnyx we tested SIP INVITE using the Telnyx Call Control API. The only functional difference from Twilio and SIgnalWire is that on Telnyx you cannot choose TCP as SIP transport (only UDP is supported).

Here is a sample Python code showing how to dial Voicegain SIP:


The complete code for an AWS Lambda function that dials a number using Telnyx and then bridges it to Voicegain SIP is available here: platform/telnyx-dial-outbound-lambda.py at master · voicegain/platform (github.com)


What can I build with the Telephony Bot API?

Our Telephony Bot API is a callback API in similar fashion as TwiML or LaML. The main difference is that it is based on JSON and our functionality is focused on Speech Recognition. You can read more about it in our blog post announcing release of that API back in August.


On out Github you can find an example of a Node.js function on AWS Lambda that demonstrates how to interface Voicegain Telephony Bot API with a RASA NLU bot: platform/examples/voicebot-lambda-vg-rasa at master · voicegain/platform (github.com)


You can also check out our sample python function code on AWS Lambda which shows how to implement more traditional (VoiceXML like) IVRs with the use of Speech grammars on top of our Telephony Bot API: platform/declarative-ivr at master · voicegain/platform (github.com)

Read more → 
How to signup for a developer account and start using Voicegain
Developers
How to signup for a developer account and start using Voicegain

Here are all the steps needed to signup for a developer account on the Voicegain Platform. Once you have the account you can access the Web Console and you can find all the info on how to use the Web Console and the APIs on our Zendesk Knowledge Base .

1. Start at console.voicegain.ai/signup  

2. Enter your name and email. If you wish you can check the Terms of Service and/or Privacy Policy.

3. On the next page let us know how you learned about Voicegain, how you wan to use Voicegain, and accept Terms of Service.


5. After you click Next, Voicegain will send you an email with the link to the next step. If you do not get the email, please check a Junk Mail folder, and if it is not there, please follow instruction on the page shown below.



6. Once you get the email, click on the Set Password button.


7. You will be directed to a web page where you can set your Voicegain password.


8. After you click (Re)set Password you will be directed to the login page where you can enter your login credentials.



9. On the next page click the right arrow icon next to "Cloud Web Console"


10. This will take you to the home page of the Voicegain Web Console. You can follow the mini tutorial that is available on the home page.


11. Help articles are available under the question mark (?) menu. There also you will find our helpdesk support link. Note, some of the support articles are available only to logged in users while others are public.



Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Category 1
This is some text inside of a div block.
by Jacek Jarmulak • 10 min read

Donec sagittis sagittis ex, nec consequat sapien fermentum ut. Sed eget varius mauris. Etiam sed mi erat. Duis at porta metus, ac luctus neque.

Read more → 
Sign up for an app today
* No credit card required.

Enterprise

Interested in customizing the ASR or deploying Voicegain on your infrastructure?

Contact Us → 
Voicegain - Speech-to-Text
Under Your Control