Researchers from DeepSeek and Tsinghua University recently published a new paper exploring scaling methods for reward model inference, seemingly advancing DeepSeek R2. Reinforcement learning is widely used in the large-scale post-training of large language models (LLMs), but acquiring accurate reward signals for LLMs remains a challenge.

image.png

The researchers found that employing point-wise generative reward modeling (GRM) improves model adaptability and scalability during inference. To this end, they propose a Self-Principle-Critique Tuning (SPCT) learning method. This method trains the DeepSeek-GRM model, such as DeepSeek-GRM-27B trained on Gemma-2-27B. Experiments show that SPCT significantly improves the quality and scalability of GRM, outperforming existing methods and models in multiple benchmark tests. Furthermore, the researchers introduce a meta reward model (meta RM) to guide the voting process, enhancing scalability.

image.png

The SPCT method comprises two stages. The first is rejection-based fine-tuning as a cold-start phase, allowing the GRM to adapt to different input types and generate principle and critique content in the correct format. The researchers utilize point-wise GRM and introduce prompt-based sampling to improve the consistency between predicted and true rewards. The second stage is rule-based online reinforcement learning, employing rule-based reward to encourage the GRM to generate better principles and critiques, thereby improving scalability during inference.

To enhance DeepSeek-GRM performance, the research team explored inference-time scaling strategies. By generating rewards for voting, they expand the reward space and improve the final reward quality. Simultaneously, training a meta reward model guides the voting process, filtering out low-quality samples. Experimental results demonstrate the excellent overall performance of DeepSeek-GRM-27B, further enhanced by inference-time scaling. Ablation studies highlight the importance of online training for GRM and the crucial role of principle generation in model performance. Furthermore, the research proves the effectiveness of inference-time scaling for DeepSeek-GRM-27B, surpassing simply increasing model size.

Key Highlights:

💡DeepSeek and Tsinghua researchers propose the Self-Principle-Critique Tuning (SPCT) method and introduce a meta reward model (meta RM) to improve the scalability of reward model inference, creating the DeepSeek-GRM series of models.

🧪SPCT, consisting of rejection-based fine-tuning and rule-based online reinforcement learning, improves GRM quality and scalability, leading to DeepSeek-GRM-27B's superior performance in benchmark tests.

📈The research team explores inference-time scaling strategies, enhancing performance through generated reward voting and meta reward model-guided voting, demonstrating the effectiveness of inference-time scaling for DeepSeek-GRM-27B over simply increasing model size.

Paper Link:

https://arxiv.org/abs/2504.02495