How to deploy STT and TTS systems to production?

Phone Calling Agents Course | Lesson 3

Dec 03, 2025

∙ Paid

In today’s session, we’ll be improving our STT and TTS systems to deliver a more natural, engaging, and human-like user experience.

Welcome to Lesson 3 of the Phone Calling Agents course!

Up until now, we’ve been relying on our dear Moonshine model for STT. And to be fair… it really did try its best. But unless you spoke in immaculate, textbook-perfect English, Moonshine would give us—two proudly accented Spaniards—that puzzled look and basically reply: “Sorry, I didn’t understand”.

Kokoro, on the other hand, has been our little TTS workhorse. For such a tiny model, it pulled more weight than anyone expected.

🤔 But let’s be honest with ourselves: was it realistic? Was it expressive? Did it sound even remotely human?

Not exactly. Functional, yes. Surprisingly good for its size, absolutely. But no one was ever going to mistake it for an actual human voice.

But today … today is special.

Because today’s lesson is all about giving our system a major leap. We are retiring the old models with honor and replacing them with true heavyweights:

Whisper for Speech-To-Text
Orpheus 3B for Text-To-Speech

This isn’t just a lesson about swapping one model for another though. This is our favorite lesson for a reason.

We are going to show you how to host your very own STT and TTS models on Runpod. Yes, you heard that right:

By the end of this Lesson, you’ll know how to deploy both faster-whisper and Orpheus 3B on a GPU Cloud … like a true AI Engineer.

Before we break down the models one by one, here’s a quick teaser of the type of interaction you’ll be able to create once you’ve completed this lesson:

💻 Get the code - Explore the repository to follow the full syllabus and access all the resources for this lesson. Also, make sure you have setup everything from the GETTING_STARTED guide!

📕 Catch up on Lesson 2 - Learn why traditional vector search falls short for our property search use case, and how Superlinked let us handle complex, multi-attribute queries.

Keep reading with a 7-day free trial

Subscribe to The Neural Maze to keep reading this post and get 7 days of free access to the full post archives.