The research teams from Stanford University and the University of Washington recently released a groundbreaking AI training method called S1. The core idea behind S1 is to significantly enhance the reasoning capabilities of language models using minimal testing-time scaling techniques. Unlike previous methods that relied on massive computational power or complex algorithms, the S1 method cleverly achieves a leap in performance by controlling the allocation of computational resources during testing.

The S1 method first carefully constructed a small dataset named s1K, which contains 1,000 high-quality reasoning questions. The selection criteria for this dataset are very strict, requiring that the questions meet three conditions: high difficulty, strong diversity, and excellent quality. The research team validated the importance of these three criteria through detailed ablation experiments, showing that random selection or focusing on only one criterion would lead to a significant drop in performance. Notably, even training on a superset containing 59,000 samples did not yield results as effective as the carefully selected 1,000 samples, highlighting the critical nature of data selection.

image.png

After the model training was completed, the researchers employed a technique called "budget enforcement" to control the computational load during testing. In simple terms, this method extends the model's thinking time by either forcibly terminating the model's thought process or adding "wait" commands, guiding the model to explore and verify more deeply. This way, the model can repeatedly check its reasoning steps and effectively correct errors.

Experimental results indicate that after fine-tuning on the s1K dataset and utilizing the "budget enforcement" technique, the s1-32B model outperformed OpenAI's o1-preview model by up to 27% on competitive-level math problems. Even more impressively, through scaling with "budget enforcement," the s1-32B model demonstrated a generalization ability that exceeded its own training level, improving its score on the AIME24 test set from 50% to 57%.

image.png

The core contribution of this research lies in providing a simple and efficient method for creating datasets with high reasoning capabilities and achieving performance scaling during testing. Based on this, the research team developed the s1-32B model, which can rival or even surpass closed-source models while maintaining open-source status and high sample efficiency. The code, models, and data from this research have been made available on GitHub.

The researchers also conducted in-depth ablation experiments on the nuances of the data and testing-time scaling techniques. They found that considering difficulty, diversity, and quality simultaneously is crucial. In terms of testing-time scaling, the "budget enforcement" method exhibited excellent controllability and performance improvement. The research also explored two different methods of scaling: parallel and sequential scaling, and introduced advanced techniques like REBASE, providing important insights for future research directions.

This research not only brings a low-cost, high-benefit new approach to the field of AI training but also lays a solid foundation for broader AI applications.