AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation

DeepSeek and Tsinghua University Joint Research: Innovative Reward Model Inference Method Improves Scalability

AIbase基地

Published inAI News · 4 min read · Apr 5, 2025

Researchers from DeepSeek and Tsinghua University recently published a new paper exploring scaling methods for reward model inference, seemingly advancing DeepSeek R2. Reinforcement learning is widely used in the large-scale post-training of large language models (LLMs), but acquiring accurate reward signals for LLMs remains a challenge.

The researchers found that employing point-wise generative reward modeling (GRM) improves model adaptability and scalability during inference. To this end, they propose a Self-Principle-Critique Tuning (SPCT) learning method. This method trains the DeepSeek-GRM model, such as DeepSeek-GRM-27B trained on Gemma-2-27B. Experiments show that SPCT significantly improves the quality and scalability of GRM, outperforming existing methods and models in multiple benchmark tests. Furthermore, the researchers introduce a meta reward model (meta RM) to guide the voting process, enhancing scalability.

The SPCT method comprises two stages. The first is rejection-based fine-tuning as a cold-start phase, allowing the GRM to adapt to different input types and generate principle and critique content in the correct format. The researchers utilize point-wise GRM and introduce prompt-based sampling to improve the consistency between predicted and true rewards. The second stage is rule-based online reinforcement learning, employing rule-based reward to encourage the GRM to generate better principles and critiques, thereby improving scalability during inference.

To enhance DeepSeek-GRM performance, the research team explored inference-time scaling strategies. By generating rewards for voting, they expand the reward space and improve the final reward quality. Simultaneously, training a meta reward model guides the voting process, filtering out low-quality samples. Experimental results demonstrate the excellent overall performance of DeepSeek-GRM-27B, further enhanced by inference-time scaling. Ablation studies highlight the importance of online training for GRM and the crucial role of principle generation in model performance. Furthermore, the research proves the effectiveness of inference-time scaling for DeepSeek-GRM-27B, surpassing simply increasing model size.

Key Highlights:

💡DeepSeek and Tsinghua researchers propose the Self-Principle-Critique Tuning (SPCT) method and introduce a meta reward model (meta RM) to improve the scalability of reward model inference, creating the DeepSeek-GRM series of models.

🧪SPCT, consisting of rejection-based fine-tuning and rule-based online reinforcement learning, improves GRM quality and scalability, leading to DeepSeek-GRM-27B's superior performance in benchmark tests.

📈The research team explores inference-time scaling strategies, enhancing performance through generated reward voting and meta reward model-guided voting, demonstrating the effectiveness of inference-time scaling for DeepSeek-GRM-27B over simply increasing model size.

Paper Link:

https://arxiv.org/abs/2504.02495

DeepSeekR2 Reward Model Self-Principle Calibration Tuning (SPCT)Meta Reward Model (metaRM)

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

DeepSeek Officially Denies March 17th Release Date for R2 Model

Today, Chinese AI company DeepSeek officially refuted rumors that its next-generation AI model, DeepSeek R2, will be released on March 17th. Previous rumors circulating on X claimed DeepSeek R2 would launch in mid-March, potentially offering breakthroughs in programming capabilities, multilingual reasoning, and cost-effectiveness. However, DeepSeek's official corporate account clarified in user groups: "Rumor debunked: The R2 release is false." This definitively puts an end to the speculation.

Mar 12, 2025

1.6k

DeepSeek Official Response: Rumors of R2 Release on March 17th are False

Mar 12, 2025

280

DeepSeek R2 Potentially Launching March 17th, Challenging Claude Sonnet 3.7

According to recent posts on X, DeepSeek's next-generation AI model, DeepSeek R2, is potentially launching on March 17th. This news has generated significant industry buzz, with many believing it could pose a strong challenge to existing AI giants such as Anthropic's Claude Sonnet 3.7. A post by X user tanvitabs early this morning suggests DeepSeek R2 boasts breakthroughs in several key areas, including improved...

Mar 11, 2025

2.3k

Alibaba Qwen Team Releases New Process Reward Model, Advancing Mathematical Reasoning

The Alibaba Qwen team recently published a paper titled 'Lessons Learned from the Development of Process Reward Models in Mathematical Reasoning' and introduced two new models in the Qwen2.5-Math-PRM series, featuring 7B and 72B parameters respectively. These models break through the limitations of the existing PRM framework in mathematical reasoning, significantly improving the accuracy and generalization ability of reasoning models through innovative techniques. Mathematical reasoning has long been a major challenge for large language models (LLMs), especially regarding errors in intermediate reasoning steps.

Jan 16, 2025

2.8k

Kunlun Wanwei Releases New Large Model Reward Model Skywork-Reward

Kunlun Wanwei Technology Co., Ltd. recently announced that its two new reward models, Skywork-Reward-Gemma-2-27B and Skywork-Reward-Llama-3.1-8B, have performed exceptionally well on the internationally recognized reward model evaluation benchmark RewardBench. Notably, the Skywork-Reward-Gemma-2-27B model achieved the top position and received high recognition from the RewardBench officials.

Sep 13, 2024

2.4k

Google DeepMind's New Method GenRM Significantly Enhances AI Reasoning Capabilities and Boosts Accuracy

The Google DeepMind team, in collaboration with academic institutions, has developed an innovative method called Generative Reward Model (GenRM) aimed at improving the accuracy and reliability of generative AI in reasoning tasks. GenRM incorporates a validation process into text generation tasks, allowing the model to simultaneously generate and evaluate potential solutions while supporting Chain of Thought (CoT), enhancing the comprehensiveness of the validation process. Compared to traditional methods, GenRM has shown significant advantages in multiple tests, with accuracy improvements ranging from 16% to 64%, particularly in...

Sep 3, 2024

3.0k