Best Low-Latency AI Engine for Real-Time Voice Cloning

This is the best low latency AI speech engine that I’ve seen, and I’m going to talk to it and understand how it is built. The best part is the team is going to open source everything, that is what they’ve promised.

This system is called unmute. It’s from a company called QI, and you might remember this name from Moshi. I selected the explanation mode to understand how unmute works.

Real-Time Voice Cloning: System Overview

It’s a modular voice AI. I wanted to understand how it has got such a low latency.

Real-Time Voice Cloning: Why Latency Is Low

The low latency with unmute comes from a few different places. The speech to text model is streaming, which means it processes your audio in small chunks rather than waiting for you to finish speaking entirely. The text LLM’s response is passed through the text to speech model, which also streams and starts speaking even before the full response is generated.

Yes, it’s essentially a cascaded system. Your speech goes to the speech to text model, then the output goes to the LLM, and finally the LLM’s response goes to the text to speech model. Each part is designed to work quickly and efficiently, passing information along as soon as it’s available.

For a look at fast transcription in practice, see Whisper JAX.

Real-Time Voice Cloning: Voice Conditioning

The text to speech model is conditioned on a 10-second voice sample. Yes, it can ground the output voice to the input voice you provide, which is how it achieves that voice cloning effect. A direct on the fly voice change is something they are working on, but it is a complex process and is not quite ready yet.

If you want a free voice generator to experiment with voice styles, try Meta MMS.

Real-Time Voice Cloning: Open Source Plan

Yes, that’s the plan. The team intends to open source everything within the next few weeks.

Real-Time Voice Cloning: Trying It Out

I wanted to try the voice cloning and see how it goes. It is really good, low latency, it can talk in multiple voices, and it can also condition from the input voice. I decided to give my own voice to see if it works with the accent that I’ve got.

Upload a voice sample. Select the file and confirm. Connect and start a pleasant conversation.

The system greeted me with, “Hey there. What are you up to this Friday evening?” That is my accent and that is how it sounds. My voice sounds like a complete crap and I did not expect it.

It’s definitely frustrating when it doesn’t sound right. I think it’s the nature of my voice. I’m not supposed to be a YouTuber, I think.

Real-Time Voice Cloning: What Stood Out

This is absolutely stunning and I’m waiting for them to open source it so we can understand the entire stack. It is amazing that you can have three systems connected together. You’ve got a speech to text, an LLM, and then text to speech, and it can also condition on your input voice.

Final Thoughts

I have massive respect for the team for promising to open source this, and I’m waiting for the models to be open source so that we can integrate it together and then play with it. I believe this is the best low latency speech AI system. Even though my voice result was not great, the overall design and responsiveness are impressive.