If you open WhatsApp and scroll through your recent chats, I bet you’ll see more than just text messages - memes, GIFs, videos … and the star of today’s article: voice notes! 🗣️
When Jesús Copado and I began the Ava project, we knew one thing for sure - Ava needed a voice. Not just any voice, but a unique one. Don’t believe me? Take a listen:
To make that happen, we needed two key systems: STT (speech-to-text) to turn incoming voice notes into text and TTS (text-to-speech) to convert Ava’s replies into audio.
That brings us to our focus today: Ava’s Voice Pipeline.
Ready? Let’s begin! 👇
This is the fourth lesson of “Ava: The Whatsapp Agent” course. This lesson builds on the theory and code covered in the previous ones, so be sure to check them out if you haven’t already!
Ava’s Voice Pipeline
Like we mentioned before, this pipeline has two main parts - the TTS module and the STT module - which can work together or on their own. Let’s go over each one in the next sections.
STT Module - Whisper👂
Whisper is a key part of Ava’s STT (speech-to-text) module because it helps Ava accurately transcribe voice messages. If you’re not familiar with “Whisper”, it’s an advanced model from OpenAI that can handle multiple languages, different accents, and even background noise - perfect for WhatsApp voice messages! 📱
In the Voice Pipeline diagram, you’ll see we’re using Groq’s Whisper - sorry, we are not hosting Whisper ourselves 😅 - , which you can check out here.
To use this model in our code, we’ve created a SpeechToText
class inside Ava’s modules. This class handles all the audio transcription logic - you’ll find the magic in the transcribe
method in the snippet below! 👀
TTS Module - ElevenLabs 🗣️
Once the audio messsage is transcribed, it moves to the LangGraph workflow (Ava’s brain, remember? 😉), which takes care of generating a response - using the short/long-term memories, Ava’s activities, etc.
But this response is just text! We need a voice, and … guess who has amazing voices ready to go?
You got it - ElevenLabs is in the house 😎
Like I mentioned at the start, we didn’t want Ava to have just any voice - we wanted something unique. That’s why we’re using ElevenLabs’ custom voice features.
In the end, all we need is an ELEVENLABS_VOICE_ID
, which uniquely defines Ava’s voice.
So, to add the voice generation logic to the code, we followed the same approach as the STT module: we created a TextToSpeech
class inside Ava’s modules.
Check the synthesize
method in the snippet below! 👀
This class gets called from the audio_node, as shown below, generating an audio_buffer that gets stored in LangGraph’s state.
LangGraph’s state will be picked up by the WhatsApp webhook endpoint (more on that in Lesson 6), turning it into a voice message you’ll get from Ava!
Check out Ava in action - roasting
both through text and voice messages! 🤣🤣🤣
And that’s all for today! 🙌
Just a quick reminder - Lesson 5 will be available next Wednesday, March 5th. Plus, don’t forget there’s a complementary video lesson on Jesús Copado’s YouTube channel.
We strongly recommend checking out both resources (written lessons and video lessons) to maximize your learning experience! 🙂
I tried running your repo, but I hit the error. sqlite3.OperationalError: unable to open database file
Something to do with creation of app/data folder.
This is after I launch chainlit and type my first message.