NetEase Youdao Launches Open Source Speech Synthesis Engine 'Yimosheng' Supporting Over 2000 Voice Tones

Sesame's newly released Conversational Speech Model (CSM) has recently sparked heated discussions on X, lauded as a voice model that sounds "just like a real person." Its stunning naturalness and emotional expressiveness not only make it indistinguishable from human speech for users, but also claim to have successfully overcome the uncanny valley effect in the field of voice technology. With the spread of demonstration videos and user feedback, CSM is rapidly becoming a leader in AI voice technology.
Recently, Meta AI open-sourced a foundational multimodal language model named SPIRIT LM, which can freely mix text and speech, opening new possibilities for multimodal tasks involving audio and text. SPIRIT LM is based on a pre-trained text language model with 7 billion parameters, which has been continuously trained on text and speech units, expanding into the speech modality. It can understand and generate text like a large text model, while also being capable of understanding and generating speech, and even mixing text and speech to create various forms of expression.
Recently, Oute AI released a novel text-to-speech synthesis method called OuteTTS-0.1-350M. This method utilizes pure language modeling without the need for external adapters or complex architectures, offering a simplified TTS approach. OuteTTS-0.1-350M is based on the LLaMa architecture, using WavTokenizer to directly generate audio tokens, making the process more efficient. The model features zero-shot voice cloning capability, requiring only a few seconds of reference audio.