Google's DeepMind research team has recently made significant breakthroughs, developing an innovative technology called SCoRe (Self-Correction through Reinforcement Learning), which aims to address the longstanding challenge of large language models (LLMs) being unable to self-correct without relying on multiple models or external checks to identify and fix errors.

The core of the SCoRe technology lies in its two-stage approach. The first stage optimizes the model initialization to generate corrections in the second attempt while maintaining similarity with the initial response from the base model. The second stage employs multi-stage reinforcement learning to teach the model how to improve both the first and second answers. This method is unique in that it only uses self-generated training data, where the model creates its own examples by solving problems and attempting to improve solutions.

QQ20240926-150104.png

In practical tests, SCoRe has shown significant performance improvements. Tests using Google's Gemini 1.0 Pro and 1.5 Flash models revealed a 15.6 percentage point increase in self-correction capabilities in mathematical reasoning tasks on the MATH benchmark. In code generation tasks on HumanEval, performance improved by 9.1 percentage points. These results indicate that SCoRe has made substantial progress in enhancing the self-correction abilities of AI models.

Researchers emphasize that SCoRe is the first method to achieve meaningful positive self-correction, allowing models to improve answers without external feedback. However, the current version of SCoRe only undergoes one round of self-correction training, and future research may explore the possibility of multiple correction steps.

This research by the DeepMind team reveals an important insight: teaching meta-strategies like self-correction requires going beyond standard language model training methods. Multi-stage reinforcement learning opens up new possibilities in the AI field, potentially driving the development of smarter and more reliable AI systems.

This breakthrough technology not only demonstrates the potential for AI self-improvement but also provides new perspectives for addressing reliability and accuracy issues in large language models, which could have profound implications for the future development of AI applications.