Today, the Alibaba Cloud Tongyi team officially released a brand new mathematical reasoning process reward model, Qwen2.5-Math-PRM. This model is available in two sizes: 72B and 7B, both of which significantly outperform similar open-source process reward models, particularly excelling in identifying reasoning errors.
The 7B version of Qwen2.5-Math-PRM astonishingly surpasses the widely popular GPT-4o in the industry, marking an important milestone for Alibaba Cloud in the development of reasoning models. To comprehensively assess the model's performance in mathematical reasoning, the Tongyi team also open-sourced the first step-level evaluation standard — ProcessBench. This evaluation standard covers 3,400 mathematical problem test cases, including challenging problems from the International Mathematical Olympiad, with each case annotated by human experts to ensure scientific and comprehensive evaluation.
Through the evaluation of Qwen2.5-Math-PRM's performance on ProcessBench, the research team found that both the 72B and 7B models performed exceptionally well. Notably, the 7B version not only surpassed other open-source models of the same size but even exceeded the closed-source GPT-4o-0806 in certain aspects. This demonstrates the tremendous potential of process reward models (PRM) in enhancing reasoning reliability and provides new insights for the future development of reasoning process supervision technologies.
The innovative work of the Alibaba Cloud Tongyi team not only advances artificial intelligence reasoning technology but also provides valuable references for other developers in the industry. Through open-sourcing, the Tongyi team hopes to share experiences with more researchers and promote technological progress across the industry.