In our previous post we described how Voicegain is providing grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.
Starting from release 1.16.0 of Voicegain Platform and API it possible to use Voicegain speech-to-text for speech transcription (without grammars) to achieve functionality like using TwiML <Gather>.
The reasons we think it will be attractive to Twilio users are:
- lower cost per each speech-to-text capture
- higher accuracy for customers who choose Acoustic Model customization
- access to all speech-to-text hypotheses in word-tree output mode
Using Voicegain as an alternative to <Gather> will have similar steps to using Voicegain for grammar-based recognition - these are listed below.
Initiating Speech Transcription with Voicegain
This is done by invoking Voicegain async transcribe API: /asr/transcribe/async
Below is an example of the payload needed to start a new transcription session:
Some notes about the content of the request:
- we are requesting the callback to return transcript in text form - other options are possible like words (individual words with confidences) and word-tree (words organized in a tree of recognition hypotheses)
- startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
- TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
- asr settings include the two timeouts used in transcription - no-input, and complete timeouts.
This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.
Note, in the transcribe mode DTMF detection is currently not possible. Please let us know if this is something that would be critical to your use case.
TwiML <Connect><Stream> request
After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:
Some notes about the content of the TwiML request:
- the websocket URL is the one returned from Voicegain /asr/transcribe/async request
- more than one question prompt is supported - they will be played one after another
- three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
- bargeIn is enabled - prompt playback will stop as soon as caller starts speaking
Returned Transcription Response
Below is an example response from the transcription in case where "content" : {"full" : ["transcript"] } .