IndexTTS, a GPT-style text-to-speech (TTS) model based on XTTS and Tortoise, has been officially released by Bilibili (B站). This system boasts a unique ability to correct the pronunciation of Chinese characters and precisely control pauses at any point using punctuation marks. This innovative technology results in more natural and fluent text-to-speech, garnering significant attention.

QQ_1740637228105.png

Trained on tens of thousands of hours of data, IndexTTS achieves industry-leading performance, surpassing popular TTS systems like XTTS, CosyVoice2, Fish-Speech, and F5-TTS. Several system modules have been enhanced, particularly in speaker condition feature representation and audio quality optimization. By incorporating hybrid modeling, IndexTTS quickly corrects mispronounced characters, improving user experience.

QQ_1740637247097.png

The model utilizes a state-of-the-art conditional encoder and a BigVGAN2-based speech decoder, improving training stability and enhancing voice timbre similarity and audio quality. The team has submitted a related paper to arXiv and plans to release model parameters and code in the coming weeks. Furthermore, IndexTTS provides various test sets, including polysyllabic vocabulary and subjective and objective evaluation sets, for researchers to conduct in-depth analysis.

IndexTTS performed exceptionally well in multiple evaluations, particularly in word error rate (WER) and speaker similarity (SS), outperforming many peer models. For instance, in Mandarin Chinese tests, IndexTTS achieved a WER of only 1.3%, significantly lower than other models, demonstrating its robustness and accuracy. Meanwhile, its Mean Opinion Score (MOS) for audio quality reached 4.01, showcasing its excellent sound quality and timbre.

With continuous technological advancements and expanding application scenarios, the release of IndexTTS marks a significant step forward in text-to-speech technology. For more information about this system, users can contact the relevant team for detailed usage experience and technical support.

Project: https://github.com/index-tts/index-tts

Key Highlights:

🌟 IndexTTS is a GPT-style TTS model based on XTTS and Tortoise, capable of correcting character pronunciation and controlling pauses.

📊 Trained on tens of thousands of hours of data, the system surpasses many existing popular TTS systems, demonstrating industry-leading performance.

🔍 IndexTTS excels in multiple evaluations, with superior word error rate and audio quality compared to other models, showcasing its significant advantages.