Xiaomi's official technology Weibo announced that its large model team has made significant progress in audio reasoning. Inspired by DeepSeek-R1, they pioneered the application of reinforcement learning algorithms to multi-modal audio understanding tasks. In just one week, the team achieved a 64.5% SOTA (State Of The Art) accuracy rate, topping the internationally renowned MMAU audio understanding benchmark and simultaneously open-sourcing the related technology.
The MMAU (Massive Multi-Task Audio Understanding and Reasoning) benchmark is a crucial standard for testing audio reasoning capabilities. It encompasses 10,000 speech, environmental sound, and music samples, designed to assess a model's performance across various skills. Human experts achieve an accuracy rate of 82.23% on this benchmark. Currently, the best-performing model is OpenAI's GPT-4o with 57.3% accuracy, followed by Google DeepMind's Gemini2.0Flash at 55.6%.
In Xiaomi's research, they initially used the AVQA dataset released by Tsinghua University for fine-tuning, achieving 51.8% accuracy. However, the real breakthrough came after applying DeepSeek-R1's Group Relative Policy Optimization (GRPO) algorithm to the Qwen2-Audio-7B model. Using only 38,000 training samples from AVQA, they achieved a 64.5% accuracy rate, surpassing existing commercial models.
The research team found that forcing the model to output the reasoning process during training actually decreased accuracy to 61.1%. This indicates that explicit chain-of-thought output may be detrimental to model training, while the real-time feedback mechanism of reinforcement learning is more conducive to the model locking onto the distribution area of high-quality answers. Despite the significant accuracy achieved, a gap remains compared to human expert levels.
The Xiaomi large model team's experimental results not only demonstrate the unique advantages of reinforcement learning in audio reasoning but also provide new avenues for future research. They have also open-sourced the training code, model parameters, and technical report to facilitate further research and exchange within academia and industry.
Training Code: https://github.com/xiaomi-research/r1-aqa
Model Parameters: https://huggingface.co/mispeech/r1-aqa
Technical Report: https://arxiv.org/abs/2503.11197
Interactive Demo: https://120.48.108.147:7860/
Key Highlights:
🔍 Xiaomi's large model team achieved a breakthrough in audio reasoning using reinforcement learning algorithms, reaching 64.5% accuracy.
📈 The MMAU benchmark is a crucial standard for audio reasoning capabilities; current human expert accuracy is 82.23%.
💡 The research results show that the real-time feedback mechanism of reinforcement learning is more effective for model training; further research is needed.