Voicegain adds grammar-based speech recognition to Twilio Programmable Voice platform via the Twilio Media Stream Feature.
The difference between Voicegain speech recognition and Twilio TwiML <Gather> is:
- Voicegain supports grammars with semantic tags (GRXML or JSGF) while <Gather> is a large vocabulary recognizer that just returns text, and
- Voicegain is significantly cheaper (we will describe the price difference in an upcoming blog post).
When using Voicegain with Twilio, your application logic will need to handle callback requests from both Twilio and Voicegain.
Each recognition will involve two main steps described below:
Initiating Speech Recognition with Voicegain
This is done by invoking Voicegain async recognition API: /asr/recognize/async
Below is an example of the payload needed to start a new recognition session:
Some notes about the content of the request:
- startInputTimers tells ASR to delay start of timers - they will be started later when the question prompt finishes playing
- TWIML is set as the streaming protocol with the format set to PCMU (u-law) and sample rate of 8kHz
- asr settings include the three standard timeouts used in grammar based recognition - no-input, complete, and incomplete timeouts
- grammar is set to GRXML grammar loaded from an external URL
This request, if successful, will return the websocket url in the audio.stream.websocketUrl field. This value will be used in making a TwiML request.
Note, if the grammar is specified to recognize DTMF, the Voicegain recognizer will recognize DTMF signals included in the audio sent from Twilio Platform.
TwiML <Connect><Stream> request
After we have initiated a Voicegain ASR session, we can tell Twilio to open Media Streams connection to Voicegain. This is done by means of the following TwiML request:
Some notes about the content of the TwiML request:
- the websocket URL is the one returned from Voicegain /asr/recognize/async request
- more than one question prompt is supported - they will be played one after another
- three types of prompts are supported: 01) recording retrieved from a URL, 02) TTS prompt (several voices are available), 03) 'clip:' prompt generated using Voicegain Prompt Manager which supports dynamic concatenation of prerecorded prompts
- bargeIn is enabled - prompt playback will stop as soon as caller starts speaking
Returned Recognition Response
Below is an example response from the recognition. This response is from built-in phone grammar.