In the fiercely competitive field of artificial intelligence, a multi-million dollar large-scale experiment is quietly revolutionizing the training methods for large language models. The StepStar research team recently released groundbreaking research, revealing a universal scaling law dubbed "Step Law." This was achieved by training 3,700 models of varying sizes from scratch, consuming nearly 1 million NVIDIA H800 GPU hours and a staggering 100 trillion tokens. This provides a new guideline for efficient large language model training.

This research goes beyond exploring hyperparameter optimization; it's the first comprehensive study examining the stability of optimal hyperparameters across different model shapes, sparsity, and data distributions. The results demonstrate the remarkable robustness of the Step Law regardless of model architecture, training data language, or domain, significantly enhancing its practical value.

The team trained 3,700 models encompassing diverse scales, hyperparameter combinations, shapes, data ratios, and sparsity levels, including both MoE and Dense architectures. Through these extensive experiments, they discovered that the optimal learning rate exhibits a power-law relationship with model and data size, while the optimal batch size is primarily correlated with data size. This finding challenges conventional wisdom regarding hyperparameter settings.

Metaverse, Sci-fi, Cyberpunk Painting (1) Large Model

Image Source Note: Image generated by AI, licensed through Midjourney.

Experimental data shows that under fixed model and data size conditions, the hyperparameter optimization landscape exhibits a clear convex characteristic. This implies the existence of a stable and easily identifiable optimal hyperparameter region. To verify this, the team constructed a 3D visualization space, intuitively showcasing the impact of learning rate and batch size on training loss. The results clearly reveal a "valley" shape, with a relatively flat region at the bottom of the convexity, providing valuable theoretical support for practical hyperparameter tuning.

To benefit the entire AI community, the team developed and released a general-purpose optimal hyperparameter estimation tool. The tool's predictions differ from the globally optimal hyperparameters obtained through exhaustive search by only 0.09%. This means researchers and engineers can bypass costly grid searches and directly obtain near-optimal hyperparameter configurations.

Even more impressive is the universality of the Step Law. The team verified its applicability from three perspectives: first, regardless of model shape—whether width-biased, depth-biased, or balanced—the Step Law accurately predicts the optimal hyperparameter region; second, this law applies not only to Dense models but also extends well to MoE models with varying sparsity; and third, the Step Law demonstrates remarkable stability regardless of whether the training data is predominantly English, bilingual English-Chinese, a mix of code and English, or primarily code-based.

The research also reveals optimization directions for learning rate scheduling strategies. Unlike traditional learning rate decay strategies, the team proposes using a fixed minimum learning rate (1e-5) instead of the traditional approach of setting the minimum value to one-tenth of the maximum. This change allows the training to maintain more reasonable parameter update steps in the later stages, effectively avoiding persistent oscillations in the loss function during convergence.

Furthermore, the study found that the optimal hyperparameters for smoothing training loss and validation loss are highly consistent. This discovery provides a more economical method for hyperparameter selection—researchers can guide hyperparameter adjustments by monitoring smoothed training loss without frequent evaluation on the validation set.

Despite the significant achievements, the StepStar research team acknowledges this is just the beginning. They plan to gradually open-source the details of the experiments, including the final checkpoints of nearly 4,000 models, for more in-depth analysis and theoretical interpretation by the community. Future research directions include exploring the convexity of the Loss-BS-LR 3D space, improving the fitting method for optimal hyperparameters, explaining changes in the optimal regions under different configurations, and delving deeper into training dynamics under different settings.

Subsequent work in the Predictable Scale series may further discuss the performance prediction of ultra-large models, the scaling properties of Code & Math, and the scaling characteristics of different Attention types. It's foreseeable that this series of research will provide more comprehensive theoretical guidance and practical tools for efficient large language model training, driving AI technology towards greater efficiency and controllability.