With the rapid development of artificial intelligence, speech synthesis technology is gaining increasing attention. Recently, the latest speech synthesis model named Kokoro was officially released on the Hugging Face platform. This model features 82 million parameters, marking an important milestone in the field of speech synthesis.

Kokoro v0.19 ranked first on the TTS (Text-to-Speech) leaderboard in the weeks leading up to its release, outperforming other models with more parameters. This model achieved results comparable to models like XTTS v2 with 467M parameters and MetaVoice with 1.2B parameters, using less than 100 hours of audio data in a monophonic setup. This achievement indicates that the relationship between the performance of traditional speech synthesis models and their parameters, computational load, and data volume may be more significant than previously expected.

For usage, users only need to run a few lines of code in Google Colab to load the model and voice packages, generating high-quality audio. Currently, Kokoro supports both American English and British English, offering multiple voice packages for users to choose from.

The training process for Kokoro utilized Vast.ai's A100 80GB vRAM instances, which are relatively low-cost to rent, ensuring an efficient training process. The entire model was trained with less than 20 training epochs and under 100 hours of audio data. The Kokoro model was trained using public domain audio data and other open-licensed audio to ensure data compliance.

Despite Kokoro's outstanding performance in speech synthesis, it currently does not support voice cloning due to limitations in its training data and architecture. The main training data is focused on long-form reading and narration rather than dialogue.

Model: https://huggingface.co/hexgrad/Kokoro-82M

Experience: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

Key Highlights:

🌟 Kokoro-82M is a newly released speech synthesis model with 82 million parameters, supporting various voice packages.  

🎤 The model excels in the TTS field, having ranked first on the leaderboard and trained with less than 100 hours of audio data.  

📊 The training of the Kokoro model utilized open-licensed data to ensure compliance, although some functional limitations still exist.