How to build a Voicebot using Voicegain, Twilio, RASA, and AWS Lambda

You can find the complete code (minus the RASA logic - you will have to supply your own) at our github repository.

What does it do ?

The setup allows you to call a phone number and then interact with a Voicebot that uses RASA as the dialog logic engine.

How does it work ?

The Components

Twilio Programmable Voice - We configure a Twilio phone number to point to a TwiML App that has the AWS Lambda function as the callback URL.
AWS Lambda function - a single Node.js function with an API Gateway trigger (simple HTTP API type).
Voicegain STT API - we are using /asr/transcribe/async api with input via websocket stream and output via a callback. Callback is to the same AWS Lambda function but Voicegain callback is POST while Twilio callback is GET.
RASA - dialog logic is provided by RASA NLU Dialog server which is accessible over RestInput API.
AWS S3 for storing the transcription results at each dialog turn.

November 2021 Update: We do not recommend S3 and AWS Lambda for a production setup. A more up to date review of various options to build a Voice Bot is described here. You should consider replacing the functionality of S3 and AWS Lambda with a web server that is able to maintain state - like Node.js or Python Flask.

The Steps

The sequence diagram is provided below. Basically, the sequence of operations is as follows:

Call a Twilio phone number
Twilio makes an initial callback to the Lambda function
Lambda function sends "Hi" RASA and RASA responds with the initial dialog prompt
Lambda function calls Voicegain to start an async transcription session. Voicegain responds with a url of a websocket for audio streaming
Lambda function responds to Twilio with a TwiML command <Connect><Stream> to open a Media Stream to Voicegain. The command will also contain the text of the question prompt.
Voicegain uses TTS to generate from the text of the RASA question an audio prompt and streams it via websocket to Twilio for playback
The Caller hears the prompt and says something in response
Twilio streams caller audio to Voicegain ASR for speech recognition
Voicegain ASR transcribes the speech to text and makes a callback with the result of transcription to Lambda function
Lambda function stores the transcription result in S3
Voicegain closes the websocket session with Twilio
Twilio notices end of session with ASR and makes a callback to Lambda function to find out what to do next
Lambda function retrieves result of recognition from S3 and passes it to RASA.
RASA processes the answer and generates next question in the dialogue
We continue next turn same as in step 4.

‍

Casey

Transcribe