Introducing Mythic Infinity

Introducing Mythic Infinity

Up to 100x Cheaper, High Quality, Realtime Text-to-Speech
10x - 100x Cheaper
Industry-Leading Latency
As low as 8 cents per hour.
~48ms model latency.

Our first model is a text-to-speech model optimized for high quality, low-cost speech generation at scale.

Even under load, audio bytes are delivered faster-than-realtime to support dynamic streaming applications.

Raw model latency is around 48ms, but we optimize for cost somewhat at the expense of latency and most users can expect a latency around 70ms plus network time. That’s faster than you can blink.

As far as I know, for a model with this level of expressiveness, reliability, and naturalness, ours is industry-leading in terms of both first-response latency and pricing.

We are targetting use-cases where massive scale makes current approaches non-economical such as AI-powered dynamic gaming and real-time chat. However, our model is useful for all applications of text-to-speech.

The Mythic Journey

Originally setting out to create AI-powered video games, I quickly found that the AI text-to-speech models available on the market were far, far too expensive to power the adventures we had imagined (by as much as 100x).

Unfortunately, at the time, there was no open-source alternative that was good enough that you could comfortably listen to it for hours.

With little experience doing AI-research, I cautiously began experimenting to see if this situation could be improved.

What began as a cautious experiment soon became the most demanding challenge of my life. It was an obsessive journey defined by late nights and unforgiving mathematics, but driven by single-minded focus (and maybe a little desperation). It pushed me far beyond what I thought I was capable of.

After 16+ months of R&D, an enormous amount of struggle, and discovery of my capability to create new AI models, I am finally launching our first model.

This is a reliable, high quality text-to-speech model with pricing as low as 8 cents per hour.

Our Groundbreaking Model

I have not attempted to create the best possible model, but instead to create the best possible model within a certain target cost budget.

With intense research I have managed to push the efficiency-performance frontier of state-of-the-art text-to-speech models.

This model is trained from scratch, based on my own developed architecture and training methods.

In terms of optimization, this is only mildly optimized.

I expect the speed characteristics of the model (including first-response latency) to improve dramatically over the next 12 months. I am hopeful that this will also translate into further decreases in pricing.

In short, there is still much work to be done, even within the current generation of the model, and a lot of room for improvement. I find this very exciting!

To put it simply, this model is something new and (I believe) amazing.

Unbelievable Pricing

Being able to provide high quality speech at a low price point was my goal from the start.

  • To try it out: 5 minutes of free audio per month.
  • Startup plan ($99 per month): about 10 cents an hour (this is measured by character, so it will vary by speaker, speaking rate, etc).

This is within reach for most individuals and startups trying to create something that needs a lot of AI speech.

For smaller projects, our Starter plan starts at $9 per month and still provides audio at a low price of about $1 per hour. A project like this can easily get started small and scale up as needed, planning to take advantage of the lower price at a moderate level of scale.

For pre-existing larger workloads, we have a Growth plan at $999 per month that provides audio as low as 8 cents per hour!

If you are a large enterprise potential customer and want to inquire about a custom plan, contact us and we will respond shortly.

How far can it go?

This release represents a significant stride in solving the text-to-speech cost problem, but in the future I would like to bring this all the way down to 1 cent per hour (and perhaps even further).

I’m confident that this can be done.

I believe that bringing down the cost of high quality AI-powered speech will expand it’s available use-cases in enormous ways and make possible the creation of whole new varieties of applications.

Examples

Highly Expressive and Natural

The model is capable of highly expressive speech. Adjusting the consistency parameter can help with this.

The man took a deep breath and turned his head slowly. His mind moved as through molasses... the realization coming only a drop at a time. Everything... yes, everything... where had it gone?
Listen! This isn't just a passing moment... This is important! Look deep... what do you feel?
Tongue Twisters

The model has remarkable reliability and text coherence.

Admittedly though, with this level of expected text coherence (especially for repeating text, like the last example here), the result varies somewhat by voice. Some voices get it right every time, and some only get it right sometimes. Adjusting voice options helps sometimes too.

Six slippery snails slid slowly seaward.
How much wood would a woodchuck chuck if a woodchuck could chuck wood?
I thought a thought, but the thought I thought wasn't the thought I thought I thought.
Accents

The model can speak in a wide range of accents.

USA Southern
British
Hispanic
Indian
Voice Cloning

The quality of a voice clone varies wildly depending on the voice.

For this reason, we are launching without the ability to do voice cloning.

We have a pretty good idea where we need to make improvements here and are planning to iterate and enable this feature in a future release.

API Access

Access to the model through the API is available to both free and paid users.

We are launching with a full-featured python client.

  • Streaming audio bytes and non-streaming both supported.
  • Full IDE support with autocomplete, type-hinting, and in-code documentation.
  • Async/await and standard sync code both supported.

Read the docs here.

Output formats

We support wav, mp3, webm_opus (opus encoded audio in a webm container), and pcm (similar to wav but without the headers) output formats. See the docs for more info.

Configurability

We currently expose one lever for adjusting the model’s output behavior.

ConsistencyHigher values tend to follow text more reliably, and pronounce words with more accuracy, but alters the way speech is spoken and may increase the speaking rates.

What’s next?

This journey has led me to see how far AI has yet to go, and all of the amazing possibilities that are on the near and far horizon.

I will be researching and training more cutting-edge models I believe will be disruptive and amazingly beneficial to creators at-large.

Right now, I still have my eyes on audio and you can expect more groundbreaking models on the way.

Also, I soon plan to revisit my original idea and build an unbelievable AI-powered video game.

The Journey Ahead

As we embark on this fantastical journey, we invite you to join the adventure!

Sign up to try out the model for yourself and receive the latest updates.

Get Started for Free
We’re Looking for Talent

If you’re interested in pushing the boundaries of what is possible with AI, head over to our Jobs Page and reach out.