【Research Upends Conventional Wisdom】

A recent joint paper from Tsinghua University and Shanghai Jiao Tong University challenges the widely held industry belief that "pure reinforcement learning (RL) can enhance the reasoning capabilities of large models." The research found that models incorporating reinforcement learning performed worse than their original counterparts in certain tasks.

image.png

【Experimental Verification】

The research team conducted systematic experiments in three major areas: mathematics, coding, and visual reasoning:

  • Mathematical Tasks: In benchmark tests like GSM8K and MATH500, RL models showed improved accuracy at low sampling rates (k-values), but a significant decrease in problem coverage at high k-values.
  • Coding Tasks: The RLVR-trained model showed improved single-sample pass@1 scores in tests like HumanEval+, but coverage decreased at high sampling rates (k=128).
  • Visual Reasoning: The Qwen-2.5-VL-7B model showed consistent performance in multimodal tasks, with RL not altering its fundamental problem-solving strategies.

image.png

【Academic Controversy】

The research results have sparked heated debate in academia:

  • Supporters argue that RL improves sampling efficiency but limits the development of reasoning capabilities.
  • Opponents suggest that the problem may lie in flawed reward structures rather than RL itself.
  • Neutral viewpoints suggest combining other methods, such as distillation, to enhance reasoning.

【Essential Considerations】

The research team proposes a key distinction:

  • Capability: The model's potential to solve problems and its logical chains.
  • Efficiency: The speed and stability of obtaining answers within a given capability.

Reinforcement learning acts more like a "capability regulator" than a "capability creator." It allows models to excel at known tasks but struggles to develop new reasoning pathways.

【Industry Implications】

This research serves as a wake-up call for the overheated RL training trend in large models, suggesting that the industry should:

  1. Pay more attention to the representational capacity and knowledge organization of base models.
  2. Clearly distinguish between capability enhancement and efficiency optimization goals.
  3. Establish a more scientific evaluation system for reasoning capabilities.