CosyVoice 2

Scalable streaming voice synthesis technology powered by large language models.

CommonProductProductivityVoice SynthesisStreaming
CosyVoice 2 is a voice synthesis model developed by Alibaba Group's SpeechLab@Tongyi team. It is based on supervised discrete speech labels and combines two popular generative models: language models (LMs) and flow matching, achieving high naturalness, content consistency, and speaker similarity in voice synthesis. This model plays a significant role in multimodal large language models (LLMs), particularly in interactive experiences where response latency and real-time factors are crucial for speech synthesis. CosyVoice 2 enhances the utilization of speech label codebooks through limited scalar quantization, simplifies the text-to-speech language model architecture, and designs a block-aware causal flow matching model to adapt to various synthesis scenarios. It has been trained on large-scale multilingual datasets, achieving human-equivalent synthesis quality with extremely low response latency and real-time performance.
Visit

CosyVoice 2 Visit Over Time

Monthly Visits

8422

Bounce Rate

61.45%

Page per Visit

1.5

Visit Duration

00:00:46

CosyVoice 2 Visit Trend

CosyVoice 2 Visit Geography

CosyVoice 2 Traffic Sources

CosyVoice 2 Alternatives