The AliTongyi Lab recently opened sourced a large-scale audio generation model project called FunAudioLLM, aiming to enhance the natural voice interaction experience between humans and Large Language Models (LLMs). The project consists of two core models: SenseVoice and CosyVoice.

CosyVoice focuses on natural voice generation, featuring multi-language support, voice and emotion control functions, and excels in multi-language voice generation, zero-shot voice generation, cross-language voice synthesis, and command execution. Trained on 150,000 hours of data, it supports Chinese, English, Japanese, Cantonese, and Korean languages, can quickly simulate voice timbres, and provide fine-grained control over emotion and rhythm.

SenseVoice is dedicated to high-precision multi-language speech recognition, emotion recognition, and audio event detection. Trained on 400,000 hours of data, it supports over 50 languages, achieving recognition results superior to the Whisper model, with improvements over 50% in Chinese and Cantonese. SenseVoice also features emotion recognition and sound event detection capabilities, as well as rapid reasoning speed.

WeChat Screenshot_20240708084503.png

FunAudioLLM supports various human-computer interaction scenarios, such as multi-language translation, emotional voice conversations, interactive podcasts, and audiobooks. By combining SenseVoice, LLMs, and CosyVoice, it can achieve seamless voice-to-voice translation, emotional voice chat applications, and interactive podcast radio stations.

In terms of technical principles, CosyVoice is based on voice quantization encoding, supporting natural and smooth voice generation, while SenseVoice provides comprehensive voice processing functions, including automatic speech recognition, language recognition, emotion recognition, and audio event detection.

The open-source models and code have been released on ModelScope and Huggingface, and training, inference, and fine-tuning code is also available on GitHub. Both the CosyVoice and SenseVoice models have online experiences on ModelScope, making it convenient for users to directly try these advanced voice technologies.

Project Address: https://github.com/FunAudioLLM