Ali Launches New AI Benchmark 'PROCESSBENCH' to Assess Error Recognition Capability in Mathematical Reasoning

AIbase基地

Published inAI News · 5 min read · Dec 15, 2024

240

Recently, researchers from the Alibaba Qwen team launched a new benchmark called "PROCESSBENCH," aimed at measuring the ability of language models to identify process errors in mathematical reasoning. As language models have made significant progress in complex reasoning tasks, researchers in this field have found that, despite the models' impressive performance, they still face challenges when dealing with certain difficult problems. Therefore, developing an effective supervision method is particularly important.

Currently, there are some shortcomings in the evaluation benchmarks for language models. On one hand, some problem sets have become too simple for advanced models, while on the other hand, existing evaluation methods often only provide binary correctness assessments, lacking detailed error annotations. This phenomenon highlights the urgent need for a more comprehensive evaluation framework to delve deeper into the reasoning mechanisms of complex language models.

To fill this gap, the researchers designed "PROCESSBENCH," which focuses on identifying erroneous steps in mathematical reasoning. Its design principles include problem difficulty, solution diversity, and comprehensive assessment. The benchmark targets competition and Olympiad-level math problems, utilizing multiple open-source language models to generate solutions that demonstrate different problem-solving methods. PROCESSBENCH contains a total of 3,400 carefully annotated test cases by multiple human experts, ensuring data quality and reliability of the evaluations.

During the development process, the research team collected math problems from four well-known datasets (GSM8K, MATH, OlympiadBench, and Omni-MATH) to ensure a wide range of difficulties from elementary to competition levels. They generated up to 12 different solutions using open-source models to increase solution diversity. Furthermore, to standardize the format of the solution steps, the team employed a reformatting approach to ensure logically complete step-by-step reasoning.

Research findings indicate that existing process reward models perform poorly when faced with high-difficulty problems, particularly on simpler problem sets where hint-driven judgment models perform better. The study reveals the limitations of existing models in evaluating mathematical reasoning, especially when models arrive at the correct answer through incorrect intermediate steps, making accurate judgment difficult.

As a pioneering benchmark for assessing language models' ability to identify errors in mathematical reasoning, PROCESSBENCH provides an important framework for future research, advancing the understanding and improvement of AI in the reasoning process.

Paper link: https://github.com/QwenLM/ProcessBench?tab=readme-ov-file

Code: https://github.com/QwenLM/ProcessBench?tab=readme-ov-file

Highlights:
🌟 The new benchmark "PROCESSBENCH" launched by the research team aims to assess the ability of language models to identify errors in mathematical reasoning.
📊 PROCESSBENCH includes 3,400 test cases covering a variety of math problems, all carefully annotated by experts.
🔍 The research found that existing process reward models perform poorly on high-difficulty problems, highlighting the need to improve their error identification strategies.

Alibaba Qwen PROCESSBENCH LanguageModel

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

Large Language Models (LLMs) have achieved significant progress in complex reasoning tasks by combining task prompts with large-scale reinforcement learning (RL), as demonstrated by models like Deepseek-R1-Zero, which directly apply reinforcement learning to base models, showcasing strong reasoning capabilities. However, this success is difficult to replicate across different base model families, especially within the Llama series. This raises a core question: what factors lead to inconsistent performance of different base models during reinforcement learning? How does reinforcement learning perform in

Jul 3, 2025

120

DeepSWE Open Source AI Agent System Makes a Strong Debut, Based on Qwen3-32B

Jul 3, 2025

600

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Welcome to the [AI Daily] section! Here is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and learn about innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1、Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand audio and directly generate natural speech. Step-Audio-AQAA is an open source end-to-end speech large model,

Jul 2, 2025

700

Zhejiang University and Alibaba jointly launch OmniAvatar: A full-body digital human model driven by audio makes a stunning debut

Zhejiang University and Alibaba have jointly launched the new audio-driven model OmniAvatar, marking a new height in digital human technology. This model is driven by audio and can generate natural and smooth full-body digital human videos, especially showing outstanding performance in singing scenarios, with mouth movements and audio lip synchronization being precise and realistic. OmniAvatar supports fine control of generation details through text prompts, allowing users to customize the range of character movements, background environment, and emotional expressions, demonstrating a high level of flexibility. In addition, this model can generate virtual characters interacting with objects

Jul 2, 2025

470

AI Daily: Alibaba Tongyi Launches Qwen-TTS Model; Cursor Now Supports Web and Mobile; ByteDance Unveils Image Synthesis Technology XVerse

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications. Discover new AI products: https://top.aibase.com/1. Qwen-TTS Launches with a Major Breakthrough in Dialect Speech Synthesis, Achieving Realism Close to Human Voices. The Qwen-TTS model, developed by Alibaba's Tongyi team, has made significant breakthroughs in the field of speech synthesis.

Jul 1, 2025

390

Qwen-TTS Launches with Major Breakthrough in Dialect Speech Synthesis, Realism Comparable to Human Voices

Jul 1, 2025

510

New Release of Qwen-TTS Adds Support for Three Chinese Dialects

Recently, a speech synthesis model called Qwen-TTS has made new progress, with its latest version update completed through the Qwen API, bringing users a richer speech synthesis experience. In this update, Qwen-TTS added support for three Chinese dialects: Beijing dialect, Shanghai dialect, and Sichuan dialect, further expanding its application scenarios. The model is trained on a large-scale corpus of more than 3 million hours, achieving naturalness and expressiveness at a human level. Qwen-TTS can not only accurately

Jul 1, 2025

260

AI Daily Report - June 30th: Baidu Open Sources the WENXIN Large Model 4.5 Series; Tongyi Qianwen Multimodal Generation Model Qwen VLo

Welcome to the AIbase [AI Daily Report] section! Spend three minutes a day to learn about the latest AI events, helping you understand AI industry trends and innovative AI product applications. For more AI news, visit: https://www.aibase.com/zh1. Baidu officially releases the WENXIN Large Model 4.5 series and fully opens it to the public, featuring ten new models with various parameter configurations. These models are trained and inferred using the PaddlePaddle framework, achieving a FLOPs utilization rate of 47%, and perform well in multi-modal text tasks.

Jun 30, 2025

480

Alibaba Ovis-U1 Launches with a Bang: A Multi-Modal AI All-in-One, Open Source Empowers Global Developers

On June 29, 2025, the Alibaba International AI Team officially released the new multi-modal large model **Ovis-U1**, marking another major breakthrough in the field of multi-modal artificial intelligence. As the latest masterpiece of the Ovis series, Ovis-U1 integrates multi-modal understanding, image generation, and image editing functions, demonstrating powerful cross-modal processing capabilities, providing new possibilities for developers, researchers, and industry applications. This is a detailed report on Ovis-U1 by AIbase. Ovis-U1

Jun 30, 2025

1.3k

Tongyi Qianwen Launches the Multimodal Unified Understanding and Generation Model Qwen VLo

Recently, the Qwen VLo multimodal large model was officially released, achieving significant advancements in image content understanding and generation, offering users a brand-new visual creation experience. According to the introduction, Qwen VLo has been comprehensively upgraded based on the advantages of the original Qwen-VL series models. The model not only can accurately understand the "world", but also can perform high-quality re-creation based on understanding, truly achieving the transition from perception to generation. Users can now access Qwen Chat (chat.qwen.ai)

Jun 28, 2025

1.8k

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Ali Launches New AI Benchmark 'PROCESSBENCH' to Assess Error Recognition Capability in Mathematical Reasoning

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

DeepSWE Open Source AI Agent System Makes a Strong Debut, Based on Qwen3-32B

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Zhejiang University and Alibaba jointly launch OmniAvatar: A full-body digital human model driven by audio makes a stunning debut

AI Daily: Alibaba Tongyi Launches Qwen-TTS Model; Cursor Now Supports Web and Mobile; ByteDance Unveils Image Synthesis Technology XVerse

Qwen-TTS Launches with Major Breakthrough in Dialect Speech Synthesis, Realism Comparable to Human Voices

New Release of Qwen-TTS Adds Support for Three Chinese Dialects

AI Daily Report - June 30th: Baidu Open Sources the WENXIN Large Model 4.5 Series; Tongyi Qianwen Multimodal Generation Model Qwen VLo

Alibaba Ovis-U1 Launches with a Bang: A Multi-Modal AI All-in-One, Open Source Empowers Global Developers

Tongyi Qianwen Launches the Multimodal Unified Understanding and Generation Model Qwen VLo