Test post

May 12, 2025

Post description
kokoro-elevenlabs

Three voices, three price tags — can you spot the one that costs less than a dollar for a million tokens?

The first voice belongs to a fresh startup that just closed a $64 million Series A and lists its API at roughly $50 per million.

  1. The second one is an open‑weight model you can run for about $0.80 per million characters.
  2. The last voice comes from a company that has raised $281 million in venture funding and bills around $150 per million characters you generate.

I don't know about you, but I found this situation interesting, so I decided to explore different models for speech generation.

Text-to-Speech Models

There's a vast variety of text-to-speech models out there. As I dug deeper into the topic, I discovered dozens of proprietary services and open-source neural networks. Quick tests immediately revealed stark differences in generation quality — some models clearly underperformed compared to others. TTS Arena proved especially useful in narrowing down a shortlist of promising candidates — it's a community-driven platform where people vote and directly compare different models.

https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2

Not All TTS Models Are Created Equal

At first glance, all the models listed above seem to handle the core task reasonably well, but the devil is in the details.

Speech realism: This is a subjective quality that's hard to describe but easy to hear. You probably wouldn't want to listen to a flat, robotic voice that randomly shifts from too loud to too quiet. Natural pauses, breathing, and expressive intonation are critical for making synthetic speech sound human.

Available voices: The number and variety of out-of-the-box voices matter. They enable product differentiation, especially if male, female, emotional, cartoonish, or accented options exist.

Prompt-based voice control: Some models allow you to control the tone or mood of the voice using text commands like "excited," "whispering," or "angry." This expands expressiveness and makes the model more versatile in real-world scenarios.

Voice cloning: The ability to reproduce a specific person's voice from a short audio clip. This is valuable for personalization, dubbing, or virtual assistants. However, the quality varies — some models require an hour of data, others only need 10 seconds, and the results are very different.

Multilingual support: The ability to speak multiple languages fluently — and ideally, mix them in a single sentence (code-switching). For example, saying an English sentence with French or Japanese words without breaking the flow or pronunciation.

It's also essential to take more technical aspects into account.

Latency: Crucial when generating long audio files or using the model in real-time interfaces. Some models can respond in milliseconds, while others take tens of seconds, even for short fragments.

Streaming support: The ability to start playback before the full audio is generated. This is essential for chatbots and voice assistants, where every second of latency impacts the user experience.

Open source: Can the model be deployed on your infrastructure? That's important if you need to use custom voices, avoid third-party APIs, or keep your data private.

Other valuable features include adjusting speaking rate, getting timestamps for words and phonemes, and using SSML markup to control pauses, emphasis, volume, and pitch.

Finally, cost. Prices can vary by orders of magnitude depending on the provider, which makes these details even more important when choosing the right TTS solution.

H2 here