You can find the complete code (minus the RASA logic - you will have to supply your own) at our github repository.
What does it do ?
The setup allows you to call a phone number and then interact with a Voicebot that uses RASA as the dialog logic engine.
How does it work ?
The Components
- Twilio Programmable Voice - We configure a Twilio phone number to point to a TwiML App that has the AWS Lambda function as the callback URL.
- AWS Lambda function - a single Node.js function with an API Gateway trigger (simple HTTP API type).
- Voicegain STT API - we are using /asr/transcribe/async api with input via websocket stream and output via a callback. Callback is to the same AWS Lambda function but Voicegain callback is POST while Twilio callback is GET.
- RASA - dialog logic is provided by RASA NLU Dialog server which is accessible over RestInput API.
- AWS S3 for storing the transcription results at each dialog turn.
November 2021 Update: We do not recommend S3 and AWS Lambda for a production setup. A more up to date review of various options to build a Voice Bot is described here. You should consider replacing the functionality of S3 and AWS Lambda with a web server that is able to maintain state - like Node.js or Python Flask.
The Steps
The sequence diagram is provided below. Basically, the sequence of operations is as follows:
- Call a Twilio phone number
- Twilio makes an initial callback to the Lambda function
- Lambda function sends "Hi" RASA and RASA responds with the initial dialog prompt
- Lambda function calls Voicegain to start an async transcription session. Voicegain responds with a url of a websocket for audio streaming
- Lambda function responds to Twilio with a TwiML command <Connect><Stream> to open a Media Stream to Voicegain. The command will also contain the text of the question prompt.
- Voicegain uses TTS to generate from the text of the RASA question an audio prompt and streams it via websocket to Twilio for playback
- The Caller hears the prompt and says something in response
- Twilio streams caller audio to Voicegain ASR for speech recognition
- Voicegain ASR transcribes the speech to text and makes a callback with the result of transcription to Lambda function
- Lambda function stores the transcription result in S3
- Voicegain closes the websocket session with Twilio
- Twilio notices end of session with ASR and makes a callback to Lambda function to find out what to do next
- Lambda function retrieves result of recognition from S3 and passes it to RASA.
- RASA processes the answer and generates next question in the dialogue
- We continue next turn same as in step 4.