CosyVoice 2
Scalable streaming voice synthesis technology powered by large language models.
CommonProductProductivityVoice SynthesisStreaming
CosyVoice 2 is a voice synthesis model developed by Alibaba Group's SpeechLab@Tongyi team. It is based on supervised discrete speech labels and combines two popular generative models: language models (LMs) and flow matching, achieving high naturalness, content consistency, and speaker similarity in voice synthesis. This model plays a significant role in multimodal large language models (LLMs), particularly in interactive experiences where response latency and real-time factors are crucial for speech synthesis. CosyVoice 2 enhances the utilization of speech label codebooks through limited scalar quantization, simplifies the text-to-speech language model architecture, and designs a block-aware causal flow matching model to adapt to various synthesis scenarios. It has been trained on large-scale multilingual datasets, achieving human-equivalent synthesis quality with extremely low response latency and real-time performance.
CosyVoice 2 Visit Over Time
Monthly Visits
8422
Bounce Rate
61.45%
Page per Visit
1.5
Visit Duration
00:00:46