In the field of artificial intelligence, with OpenAI's o1 and DeepSeek's R1 models gaining significant attention, the reasoning capabilities and test-time scaling (TTS) techniques of large language models (LLMs) have sparked considerable research interest. However, accurately evaluating the quality of a model's response at each step when dealing with complex reasoning problems remains a challenge. To address this, Tsinghua University and Shanghai AI Lab jointly proposed the Generative Process Reward Model (GenPRM), offering an innovative solution for process-supervised reasoning.
Traditional Process Reward Models (PRMs), while capable of verifying the correctness of reasoning steps, struggle to capture deep-level logical errors due to their scalar scoring mechanism. Furthermore, the discriminative modeling approach of PRMs limits their scalability during the testing phase. GenPRM addresses these limitations by incorporating generative chain-of-thought reasoning and code verification, and by introducing a test-time scaling mechanism, opening up a new research direction.
Image Source Note: Image generated by AI, image licensing provided by Midjourney
GenPRM's design philosophy mimics the human problem-solving process, allowing the model to perform natural language analysis at each reasoning step. This enhances transparency and makes step evaluation more interpretable. Simultaneously, GenPRM generates and executes Python code related to the reasoning, ensuring reliability. This "explain-then-verify" mechanism not only judges correctness but also provides specific suggestions for improvement, significantly enhancing the effectiveness of process supervision.
Surprisingly, GenPRM achieved superior performance to GPT-4o using only 23K training samples. In tests on mathematical reasoning benchmarks like ProcessBench, the 1.5B parameter GenPRM, boosted by test-time scaling, performed exceptionally well; while its 7B parameter version successfully surpassed the 72B parameter Qwen2.5-Math-PRM, demonstrating powerful step-level critique capabilities.
Furthermore, GenPRM's advantages extend to its efficient data synthesis method. Through Relative Progress Estimation (RPE) and code verification, GenPRM generated high-quality process supervision data, significantly reducing the need for large amounts of labeled data. Researchers used the QwQ-32B model to synthesize data, and consensus filtering was employed to retain high-quality samples, resulting in the 23K training set.
In the future, GenPRM can serve not only as an answer verifier but also as a "coach," guiding the iterative optimization of strategy models through feedback. This "generate-critique-reflect" closed loop provides a novel path for the self-improvement of large language models, and may be extended to areas like code generation and multi-modal reasoning in the future.