Best Low-Latency AI Engine for Real-Time Voice Cloning

This is the best low latency AI speech engine that I’ve seen, and I’m going to talk to it and understand how it is built. The best part is the team is going to open source everything, that is what they’ve promised.

This system is called unmute. It’s from a company called QI, and you might remember this name from Moshi. I selected the explanation mode to understand how unmute works.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 1

Real-Time Voice Cloning: System Overview

It’s a modular voice AI. I wanted to understand how it has got such a low latency.

Real-Time Voice Cloning: Why Latency Is Low

The low latency with unmute comes from a few different places. The speech to text model is streaming, which means it processes your audio in small chunks rather than waiting for you to finish speaking entirely. The text LLM’s response is passed through the text to speech model, which also streams and starts speaking even before the full response is generated.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 2

Yes, it’s essentially a cascaded system. Your speech goes to the speech to text model, then the output goes to the LLM, and finally the LLM’s response goes to the text to speech model. Each part is designed to work quickly and efficiently, passing information along as soon as it’s available.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 3

For a look at fast transcription in practice, see Whisper JAX.

Real-Time Voice Cloning: Voice Conditioning

The text to speech model is conditioned on a 10-second voice sample. Yes, it can ground the output voice to the input voice you provide, which is how it achieves that voice cloning effect. A direct on the fly voice change is something they are working on, but it is a complex process and is not quite ready yet.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 4

If you want a free voice generator to experiment with voice styles, try Meta MMS.

Real-Time Voice Cloning: Open Source Plan

Yes, that’s the plan. The team intends to open source everything within the next few weeks.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 5

Real-Time Voice Cloning: Trying It Out

I wanted to try the voice cloning and see how it goes. It is really good, low latency, it can talk in multiple voices, and it can also condition from the input voice. I decided to give my own voice to see if it works with the accent that I’ve got.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 6

Upload a voice sample. Select the file and confirm. Connect and start a pleasant conversation.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 7

The system greeted me with, “Hey there. What are you up to this Friday evening?” That is my accent and that is how it sounds. My voice sounds like a complete crap and I did not expect it.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 8

It’s definitely frustrating when it doesn’t sound right. I think it’s the nature of my voice. I’m not supposed to be a YouTuber, I think.

Real-Time Voice Cloning: What Stood Out

This is absolutely stunning and I’m waiting for them to open source it so we can understand the entire stack. It is amazing that you can have three systems connected together. You’ve got a speech to text, an LLM, and then text to speech, and it can also condition on your input voice.

Best Low-Latency AI Engine for Real-Time Voice Cloning screenshot 9

Read More: Ai Podcast Transcripts Save Time

Final Thoughts

I have massive respect for the team for promising to open source this, and I’m waiting for the models to be open source so that we can integrate it together and then play with it. I believe this is the best low latency speech AI system. Even though my voice result was not great, the overall design and responsiveness are impressive.

Leave a Comment