MinMo
MinMo is a multimodal large language model designed for seamless voice interaction.
CommonProductchatting\Voice InteractionMultimodal
MinMo, developed by Alibaba Group's Tongyi Laboratory, is a multimodal large language model with approximately 8 billion parameters, focused on achieving seamless voice interactions. It is trained on 1.4 million hours of diverse voice data through various stages, including speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and full-duplex interaction alignment. MinMo achieves state-of-the-art performance across various benchmarks in speech understanding and generation, while maintaining the capabilities of text-based large language models and supporting full-duplex dialogues, enabling simultaneous bidirectional communication between users and the system. Additionally, MinMo introduces a novel and straightforward voice decoder that surpasses previous models in speech generation. Its command-following ability has been enhanced to support voice generation control based on user instructions, including details such as emotion, dialect, and speech rate, as well as mimicking specific voices. MinMo's speech-to-text latency is approximately 100 milliseconds, with theoretical full-duplex latency around 600 milliseconds, and actual latency around 800 milliseconds. The development of MinMo aims to overcome the major limitations of previous multimodal models, providing users with a more natural, smooth, and human-like voice interaction experience.
MinMo Visit Over Time
Monthly Visits
38651
Bounce Rate
65.83%
Page per Visit
1.5
Visit Duration
00:01:54