The Ultimate AI Voice Pipeline

Lesson 4: Giving Ava a voice

Feb 26, 2025

If you open WhatsApp and scroll through your recent chats, I bet you’ll see more than just text messages - memes, GIFs, videos … and the star of today’s article: voice notes! 🗣️

When Jesús Copado and I began the Ava project, we knew one thing for sure - Ava needed a voice. Not just any voice, but a unique one. Don’t believe me? Take a listen:

1×

0:00

-0:03

To make that happen, we needed two key systems: STT (speech-to-text) to turn incoming voice notes into text and TTS (text-to-speech) to convert Ava’s replies into audio.

That brings us to our focus today: Ava’s Voice Pipeline.

Ready? Let’s begin! 👇

Check the code here! 🧑‍💻

This is the fourth lesson of “Ava: The Whatsapp Agent” course. This lesson builds on the theory and code covered in the previous ones, so be sure to check them out if you haven’t already!
Lesson One: Project Overview
Lesson Two: Dissecting Ava’s brain
Lesson Three: Unlocking Ava’s memories

Ava’s Voice Pipeline

Like we mentioned before, this pipeline has two main parts - the TTS module and the STT module - which can work together or on their own. Let’s go over each one in the next sections.

STT Module - Whisper👂

Whisper architecture (source: https://cdn.openai.com/papers/whisper.pdf)

Whisper is a key part of Ava’s STT (speech-to-text) module because it helps Ava accurately transcribe voice messages. If you’re not familiar with “Whisper”, it’s an advanced model from OpenAI that can handle multiple languages, different accents, and even background noise - perfect for WhatsApp voice messages! 📱

In the Voice Pipeline diagram, you’ll see we’re using Groq’s Whisper - sorry, we are not hosting Whisper ourselves 😅 - , which you can check out here.

To use this model in our code, we’ve created a SpeechToText class inside Ava’s modules. This class handles all the audio transcription logic - you’ll find the magic in the transcribe method in the snippet below! 👀

The SpeechToText class takes care of all the transcription logic

TTS Module - ElevenLabs 🗣️

Once the audio messsage is transcribed, it moves to the LangGraph workflow (Ava’s brain, remember? 😉), which takes care of generating a response - using the short/long-term memories, Ava’s activities, etc.

But this response is just text! We need a voice, and … guess who has amazing voices ready to go?

You got it - ElevenLabs is in the house 😎

Like I mentioned at the start, we didn’t want Ava to have just any voice - we wanted something unique. That’s why we’re using ElevenLabs’ custom voice features.

In the end, all we need is an ELEVENLABS_VOICE_ID, which uniquely defines Ava’s voice.

So, to add the voice generation logic to the code, we followed the same approach as the STT module: we created a TextToSpeech class inside Ava’s modules.

Check the synthesize method in the snippet below! 👀

This class gets called from the audio_node, as shown below, generating an audio_buffer that gets stored in LangGraph’s state.

LangGraph’s state will be picked up by the WhatsApp webhook endpoint (more on that in Lesson 6), turning it into a voice message you’ll get from Ava!

Check out Ava in action - roasting
Alexandru Vesa
both through text and voice messages! 🤣🤣🤣

And that’s all for today! 🙌

Just a quick reminder - Lesson 5 will be available next Wednesday, March 5th. Plus, don’t forget there’s a complementary video lesson on Jesús Copado’s YouTube channel.

We strongly recommend checking out both resources (written lessons and video lessons) to maximize your learning experience! 🙂

The Ultimate AI Voice Pipeline

Lesson 4: Giving Ava a voice

Ava’s Voice Pipeline

STT Module - Whisper👂

TTS Module - ElevenLabs 🗣️

Discussion about this post