Recently, Xiaomi's large model team achieved a breakthrough in audio reasoning, successfully applying reinforcement learning algorithms to multi-modal audio understanding tasks. They achieved an accuracy rate of 64.5%, securing first place in the prestigious MMAU audio understanding benchmark.

The MMAU (Massive Multi-Task Audio Understanding and Reasoning) benchmark is a crucial standard for measuring audio reasoning capabilities. It tests model performance on complex reasoning tasks by analyzing diverse audio samples, including speech, environmental sounds, and music. Human expert accuracy is 82.23%, while the previous top-performing model, OpenAI's GPT-4o, achieved 57.3%. Xiaomi's achievement is particularly noteworthy in this context.

image.png

The team employed DeepSeek-R1's Group Relative Policy Optimization (GRPO) method, a "trial-and-error-reward" mechanism that allows the model to evolve autonomously, exhibiting human-like reflection and reasoning capabilities. Notably, even with only 38,000 training samples and the support of reinforcement learning, Xiaomi's model achieved a 64.5% accuracy rate on the MMAU benchmark, surpassing the previous leader by nearly 10 percentage points.

Furthermore, experiments revealed that traditional explicit chain-of-thought output methods actually decreased model accuracy, highlighting the advantage of implicit reasoning during training. Despite this significant achievement, the Xiaomi team acknowledges that there's still room for improvement compared to human expert levels. They plan to continue optimizing reinforcement learning strategies to achieve even better reasoning capabilities.

This research success not only demonstrates the potential of reinforcement learning in audio understanding but also paves the way for a future intelligent auditory era. As machines become capable of not only "hearing" sounds but also "understanding" the underlying causal logic, intelligent audio technology will experience new development opportunities. The Xiaomi team will also open-source the training code and model parameters to facilitate further research and exchange within academia and industry.

Training Code: https://github.com/xiaomi-research/r1-aqa

Model Parameters: https://huggingface.co/mispeech/r1-aqa

Technical Report: https://arxiv.org/abs/2503.11197

Interactive Demo: https://120.48.108.147:7860/