On March 27th, the Alibaba Cloud Qwen team announced the launch of Qwen2.5-Omni, a new generation of end-to-end multimodal flagship model in the Qwen model family. This new model is designed for comprehensive multimodal perception, seamlessly handling various input formats such as text, images, audio, and video, and generating text and natural speech synthesis output simultaneously through real-time streaming responses.
Qwen2.5-Omni adopts the innovative Thinker-Talker architecture, an end-to-end multimodal model designed to support cross-modal understanding of text, images, audio, and video, and generate text and natural speech responses in a streaming manner. The Thinker module, like a brain, processes multimodal inputs to generate high-level semantic representations and corresponding text content; the Talker module, similar to a vocal organ, receives the semantic representations and text output by the Thinker in real-time and smoothly synthesizes discrete speech units. Furthermore, the model introduces a novel positional encoding technique, TMRoPE (Time-aligned Multimodal RoPE), achieving precise synchronization of video and audio inputs through time-axis alignment.
The model excels in real-time audio-video interaction, supporting chunk input and instant output, enabling completely real-time interaction. In terms of the naturalness and stability of speech generation, Qwen2.5-Omni surpasses many existing streaming and non-streaming alternatives. In terms of full-modality performance, Qwen2.5-Omni demonstrates superior performance when benchmarked against single-modality models of comparable size. Its audio capabilities outperform similarly sized Qwen2-Audio and are on par with Qwen2.5-VL-7B. Additionally, Qwen2.5-Omni exhibits performance comparable to text input processing in end-to-end speech instruction following, and performs exceptionally well in benchmarks such as MMLU (Massive Multitask Language Understanding) for general knowledge comprehension and GSM8K for mathematical reasoning.
Qwen2.5-Omni outperforms similarly sized single-modality models and closed-source models, such as Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro, across various modalities including images, audio, and audio-video. It achieves state-of-the-art (SOTA) performance in the OmniBench multimodal task. In single-modality tasks, Qwen2.5-Omni excels in multiple areas, including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective natural listening).
Currently, Qwen2.5-Omni is open-sourced on Hugging Face, ModelScope, DashScope, and GitHub. Users can experience interactive functions through a demo or initiate voice or video chats directly through Qwen Chat for an immersive experience of the powerful performance of the new Qwen2.5-Omni model.
Qwen Chat:https://chat.qwenlm.ai
Hugging Face:https://huggingface.co/Qwen/Qwen2.5-Omni-7B
ModelScope:https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B
DashScope:https://help.aliyun.com/zh/model-studio/user-guide/qwen-omni