Qwen2.5-Omni is a new generation of end-to-end multimodal flagship model launched by Alibaba Cloud's Tongyi Qianwen team. Designed for comprehensive multimodal perception, this model seamlessly handles various input formats such as text, images, audio, and video, and generates text and natural speech synthesis output simultaneously through real-time streaming responses. Its innovative Thinker-Talker architecture and TMRoPE positional encoding technology enable it to excel in multimodal tasks, especially in audio, video, and image understanding. The model surpasses similar-scale unimodal models in several benchmark tests, demonstrating powerful performance and broad application potential. Currently, Qwen2.5-Omni is open-sourced on Hugging Face, ModelScope, DashScope, and GitHub, providing developers with abundant usage scenarios and development support.