Access OpenAI’s Whisper model with Voicegain's easy-to-use REST APIs. Get Voicegain enterprise support, SOC2 and PCI compliance and added features like two-channel(stereo) support, diarization, word-level timestamps and more.
Whisper is an open-source deep-learning-based automatic speech recognition (ASR) model developed by Open AI. Whisper is trained on 680,000 hours of multilingual data; which enables it to work well with range of accents and background noise.
The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer.
Developers can easily feed the transcript output to an LLM like GPT for improving transcript readability, summarization, extracting sentiment and more analytics.
OpenAI Whisper ASR can transcribe in multiple languages. The following 57 languages have a Word Error Rate of < 50%. Check out our fine-tuning services to get a better ASR.
Whisper is predominantly trained for English and hence Word Error Rates for other languages might still be high. Voicegain offers Whisper fine-tuning services on your data to get higher accuracy and lower WER.
Voicegain Whisper Speech-to-Text API is affordably priced at at $0.25/hour (for US-based instance); This is 40% lower than Open AI’s price (as of Dec 2023)
Deploy Voicegain Whisper in your datacenter or in your VPC instance for maximum data privacy and control. Ingest our logs and metrics into your Grafana to monitor performance.
Voicegain Whisper adds key features like diarization and word-level timestamps to Open AI’s Whisper
Voicegain’s offers a high-touch 24/7 enterprise-class support for the Whisper model. This allows developers to focus their efforts on LLM optimization and use our APIs for ASR.
Voicegain is a PCI-DSS and SOC-2 Compliant organization. We redact all the PCI and PII related entities – both in the transcript and audio. We scan the underlying code for any vulnerabilities and keep all libraries current.
Whisper has been pre-dominantly trained on publicly available English datasets. Voicegain can provide fine-tuning services to Whisper with your data to reduce the WER on your dataset.
You can sign up today for a developer account using your business email address.
If you quickly want to check the accuracy of Whisper without writing code, you can check out the first option mentioned below. If are a developer and you want to actually test our APIs, check-out Option 2 below.
Today we are really excited to announce the launch of Voicegain Whisper, an optimized version of Open AI's Whisper Speech recognition/ASR model that runs on Voicegain managed cloud infrastructure and accessible using Voicegain APIs. Developers can use the same well-documented robust APIs and infrastructure that processes over 60 Million minutes of audio every month for leading enterprises like Samsung, Aetna and other innovative startups like Level.AI, Onvisource and DataOrb.
The Voicegain Whisper API is a robust and affordable batch Speech-to-Text API for developersa that are looking to integrate conversation transcripts with LLMs like GPT 3.5 and 4 (from Open AI) PaLM2 (from Google), Claude (from Anthropic), LLAMA 2 (Open Source from Meta), and their own private LLMs to power generative AI apps. Open AI open-sourced several versions of the Whisper models released. With today's release Voicegain supports Whisper-medium, Whisper-small and Whisper-base. Voicegain now supports transcription in over multiple languages that are supported by Whisper.
Here is a link to our product page
There are four main reasons for developers to use Voicegain Whisper over other offerings:
While developers can use Voicegain Whisper on our multi-tenant cloud offering, a big differentiator for Voicegain is our support for the Edge. The Voicegain platform has been architected and designed for single-tenant private cloud and datacenter deployment. In addition to the core deep-learning-based Speech-to-text model, our platform includes our REST API services, logging and monitoring systems, auto-scaling and offline task and queue management. Today the same APIs are enabling Voicegain to processes over 60 Million minutes a month. We can bring this practical real-world experience of running AI models at scale to our developer community.
Since the Voicegain platform is deployed on Kubernetes clusters, it is well suited for modern AI SaaS product companies and innovative enterprises that want to integrate with their private LLMs.
At Voicegain, we have optimized Whisper for higher throughput. As a result, we are able to offer access to the Whisper model at a price that is 40% lower than what Open AI offers.
Voicegain also offers critical features for contact centers and meetings. Our APIs support two-channel stereo audio - which is common in contact center recording systems. Word-level timestamps is another important feature that our API offers which is needed to map audio to text. There is another feature that we have for the Voicegain models - enhanced diarization models - which is a required feature for contact center and meeting use-cases - will soon be made available on Whisper.
We also offer premium support and uptime SLAs for our multi-tenant cloud offering. These APIs today process over 60 millions minutes of audio every month for our enterprise and startup customers.
OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture of the model is based on encoder-decoder transformers system and has shown significant performance improvement compared to previous models because it has been trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.
Learn more about Voicegain Whisper by clicking here. Any developer - whether a one person startup or a large enterprise - can access Voicegain Whisper model by signing up for a free developer account. We offer 15,000 mins of free credits when you sign up today.
There are two ways to test Voicegain Whisper. They are outlined here. If you would like more information or if you have any questions, please drop us an email support@voicegain.ai
On March 1st 2023, Open AI announced that developers could access the Open AI Whisper Speech-to-Text model via easy-to-use REST APIs. OpenAI also released APIs to GPT3.5, the LLM behind the buzzy ChatGPT product. General availability of the next version of LLM - GPT 4 is expected in July 2023.
Since Open AI Whisper's initial release in October 2022, it has been a big draw for developers. A highly accurate open-source ASR is extremely compelling. OpenAI's Whisper has been trained on 680,000 hours of audio data which is much more than what most models are trained on. Here is a link to their github.
However the developer community looking to leverage Whisper faces three major limitations:
1. Infrastructure Costs: Running Whisper - especially the large and medium models - requires expensive memory-intensive GPU based compute options (see below).
2. In-house AI expertise: To use Open AI's Whisper model, a company has to invest in building an in-house ML engineering team that is able to operate, optimize and support Whisper in a production environment. While Whisper provides core features like Speech-to-Text, language identification, punctuation and formatting, there are still some missing AI features like speaker diarization and PII redaction that would need to be developed. In addition, companies would need to put in place a real-time NOC for ongoing support. Even a small scale 2-3 person developer team could be expensive to hire and maintain - unless the call volumes justify such an investment. This in-house team also needs to take full responsibility for the Cloud infrastructure related tasks like auto-scaling and log monitoring to ensure uptime.
3. Lack of support for real-time: Whisper is a batch speech-to-text model. For developers requiring streaming Speech-to-Text models, they need to evaluate other ASR/STT options.
By now taking over the responsibility of hosting this model and making it accessible via easy-to-use APIs, both Open AI and Voicegain addresses the first two limitations.
Aug 2023 Update: On Aug 5th 2023, Voicegain announced the release Voicegain Whisper, an optimized version of Open AI's Whisper using Voicegain APIs. Here is a link to the announcement. In addition to Voicegain Whisper, Voicegain also offer realtime/streaming Speech-to-Text and other features like two-channel/stereo support (required for call centers), speaker diarization and PII redaction. All of this is offered in Voicegain's PCI and SOC-2 compliant infrastructure.
This article highlights some of the key strengths and limitations of using Whisper - whether using Open AI's APIs, Voicegain APIs or hosting it on your own.
In our benchmark tests, OpenAI's Whisper models demonstrated high accuracy for a widely diverse range of audio datasets. Our ML engineers concluded that the Whisper models perform well on audio datasets ranging from meetings, podcasts, classroom lectures, YouTube videos and call center audio. We benchmarked Whisper-base, Whisper-small and Whisper-medium against some of the best ASR/Speech-to-Text engines in the market.
The median Word Error Rate (WER) for Whisper-medium was 11.46% for meeting audio and 17.7% for call center audio. This was indeed lower than the WERs of STT offerings of other large players like Microsoft Azure and Google. We did find that AWS Transcribe had a WER that is competitive with Whisper.
Here is an interesting observation - it is possible to exceed Whisper's recognition accuracy, however it would take building custom models. Custom models are models that are trained on our client's specific audio data. In fact for call center audio, our ML Engineers were able to demonstrate that our call-center specific Speech-to-text models were either equal to or even better than some of the Whisper models. This makes intuitive sense because call center audio is not readily available on the internet for Open AI to get access to.
Please contact us via email (support@voicegain.ai) if you would like to review and validate/test these accuracy benchmarks.
Whisper's pricing at $0.006/min ($0.36/hour) is much lower than the Speech-to-Text offerings of some of the other larger cloud players. This translates to a 75% discount to Google Speech-to-Text and AWS Transcribe (based on pricing as of the date of this post).
Aug 2023 Update: At the launch of Voicegain Whisper, Voicegain announced a list price at $0.0037/min ($0.225/hour). This price is 37.5% lower than Open AI's price and has been accomplished since we optimized the throughput of Whisper. To test it out, please sign up for a free developer account. Instructions are provided here.
What was also significant was Open AI announced the release of ChatGPT APIs with the release of Whisper APIs. Developers can combine the power of Whisper Speech-to-Text models with the GPT 3.5 and GPT 4.0 LLM (the underlying model that ChatGPT uses) to power very interesting conversational AI apps. However here is an important consideration - Using Whisper API with LLMs like ChatGPT works as long as the app only uses batch/pre-recorded audio (e.g analyzing recording of call center conversations for QA or Compliance or transcribe and mine Zoom meetings to recollect context). For developers looking to build Voice Bots or Speech IVRs, they would need a good real-time Speech-to-Text model.
As stated above, Open AI's Whisper does not support apps that require real-time/streaming transcription - this could be relevant to a wide variety of AI apps that target call center, education, legal and meetings use-case. In case you are looking for a streaming Speech-to-Text API provider, please feel free to contact us with the email address provided below
The throughput of Whisper models - both for the medium and large models - is relatively low. At Voicegain, our ML engineers have tested the throughput of Whisper models on several popular NVIDIA GPU-based compute instances available in public clouds (AWS, GCP, Microsoft Azure and Oracle Cloud). We also have real-life experience because we process over 10 million hours of audio annually. As a result, we have a strong understanding of what it takes to run a model like OpenAI's Whisper in a production environment.
We have found out that the infrastructure cost of running Whisper-medium in a cloud environment is in the range of $0.07 - $0.10/hour. You can contact us via email to get the in-depth assumptions and backup behind our cost model. An important factor to note is that in a single-tenant production environment the compute infrastructure cannot be run at a very high utilization. The peak throughput required to support real-life traffic can be several times (2-3x) the average throughput. Net-net, we determined that while developers would not have to pay for software licensing, the cloud infrastructure costs would still remain substantial.
In addition to this infrastructure cost the larger expense of running Whisper on the Edge (On-Premise + Private Cloud) is that it would require a dedicated back-end Engineering & Devops team that can chop the audio recording into segments that can be submitted to Whisper and perform the queue management. This team would need to also oversee all info-sec and compliance needs (e.g. running vulnerability scans, intrusion detection etc).
As of the publication of this post, Whisper does not have a multi-channel audio API. So if your application involves audio with multiple speakers, then Whisper's effective price-per-min = Number of channels * 0.006. For both meetings and call center use-cases, this pricing can become prohibitive.
This release of Whisper is missing some key features that developers would need. The three important features we noticed are Diarization (speaker separation), Time-stamps and PII Redaction.
Voicegain is working on releasing a Voicegain-Whisper Model over its APIs. With this developers can get benefits of Voicegain PCI/SOC-2 compliant infrastructure and advanced features like diarization, PII redaction, PCI compliance and time-stamps. To join the waitlist, please email us at sales@voicegain.ai
At Voicegain, we build deep-learning-based Speech-to-Text/ASR models that match or exceed the accuracy of STT models from the large players. For over 4 years now, startup and enterprise customers have used our APIs to build and launch successful products that process over 600 million minutes annually. We focus on developers that need high accuracy (achieved by training custom acoustic models) and deployment in private infrastructure at an affordable price. We provide an accuracy SLA where we guarantee that a custom model that is trained on your data will be as accurate if not more than most popular options including Open AI's Whisper.
We also have models that are trained specifically on call center audio. While Whisper is a worthy competitor (of course a much larger company with 100x our resources), as developers we welcome the innovation that Open AI is unleashing in this market. By adding ChatGPT APIs to our Speech-to-Text , we are planning to broaden our API offerings to developer community.
To sign up for a developer account on Voicegain with free credits, click here.