Million-Dollar LLM Training Unveils Step Law: Jieyue Xingchen Releases Universal Hyperparameter Optimization Tool

In the fiercely competitive field of artificial intelligence, a multi-million dollar large-scale experiment is quietly revolutionizing the training methods for large language models. The StepStar research team recently released groundbreaking research, revealing a universal scaling law dubbed "Step Law." This was achieved by training 3,700 models of varying sizes from scratch, consuming nearly 1 million NVIDIA H800 GPU hours and a staggering 100 trillion tokens. This provides a new guideline for efficient large language model training.

This research goes beyond exploring hyperparameter optimization; it's the first comprehensive study examining the stability of optimal hyperparameters across different model shapes, sparsity, and data distributions. The results demonstrate the remarkable robustness of the Step Law regardless of model architecture, training data language, or domain, significantly enhancing its practical value.

The team trained 3,700 models encompassing diverse scales, hyperparameter combinations, shapes, data ratios, and sparsity levels, including both MoE and Dense architectures. Through these extensive experiments, they discovered that the optimal learning rate exhibits a power-law relationship with model and data size, while the optimal batch size is primarily correlated with data size. This finding challenges conventional wisdom regarding hyperparameter settings.

Metaverse, Sci-fi, Cyberpunk Painting (1) Large Model

Image Source Note: Image generated by AI, licensed through Midjourney.

Experimental data shows that under fixed model and data size conditions, the hyperparameter optimization landscape exhibits a clear convex characteristic. This implies the existence of a stable and easily identifiable optimal hyperparameter region. To verify this, the team constructed a 3D visualization space, intuitively showcasing the impact of learning rate and batch size on training loss. The results clearly reveal a "valley" shape, with a relatively flat region at the bottom of the convexity, providing valuable theoretical support for practical hyperparameter tuning.

To benefit the entire AI community, the team developed and released a general-purpose optimal hyperparameter estimation tool. The tool's predictions differ from the globally optimal hyperparameters obtained through exhaustive search by only 0.09%. This means researchers and engineers can bypass costly grid searches and directly obtain near-optimal hyperparameter configurations.

Even more impressive is the universality of the Step Law. The team verified its applicability from three perspectives: first, regardless of model shape—whether width-biased, depth-biased, or balanced—the Step Law accurately predicts the optimal hyperparameter region; second, this law applies not only to Dense models but also extends well to MoE models with varying sparsity; and third, the Step Law demonstrates remarkable stability regardless of whether the training data is predominantly English, bilingual English-Chinese, a mix of code and English, or primarily code-based.

The research also reveals optimization directions for learning rate scheduling strategies. Unlike traditional learning rate decay strategies, the team proposes using a fixed minimum learning rate (1e-5) instead of the traditional approach of setting the minimum value to one-tenth of the maximum. This change allows the training to maintain more reasonable parameter update steps in the later stages, effectively avoiding persistent oscillations in the loss function during convergence.

Furthermore, the study found that the optimal hyperparameters for smoothing training loss and validation loss are highly consistent. This discovery provides a more economical method for hyperparameter selection—researchers can guide hyperparameter adjustments by monitoring smoothed training loss without frequent evaluation on the validation set.

Despite the significant achievements, the StepStar research team acknowledges this is just the beginning. They plan to gradually open-source the details of the experiments, including the final checkpoints of nearly 4,000 models, for more in-depth analysis and theoretical interpretation by the community. Future research directions include exploring the convexity of the Loss-BS-LR 3D space, improving the fitting method for optimal hyperparameters, explaining changes in the optimal regions under different configurations, and delving deeper into training dynamics under different settings.

Subsequent work in the Predictable Scale series may further discuss the performance prediction of ultra-large models, the scaling properties of Code & Math, and the scaling characteristics of different Attention types. It's foreseeable that this series of research will provide more comprehensive theoretical guidance and practical tools for efficient large language model training, driving AI technology towards greater efficiency and controllability.

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Million-Dollar LLM Training Unveils Step Law: Jieyue Xingchen Releases Universal Hyperparameter Optimization Tool

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Stanford Report Confirms: Alibaba's Qwen Ranks Third Globally in Large Model Contribution, Reshaping Global Competition with Computing Power!

Huawei Noah's Ark Lab and HKU Release Dream 7B, a Powerful Open-Source Diffusion Language Model

Yiren Digital's Zhiyu Large Language Model Successfully Registered: A Key Step Towards Compliance and AI-Powered Financial Future

OmniSVG: A New Benchmark in Multimodal Vector Graphic Generation from Fudan University and Jieyue Xingchen

Stanford AI Index Report: Closing Performance Gap Between US and Chinese AI Models, Alibaba Model Rises to Third Globally

Apple iOS 19 AI Features Revealed: Enhanced Summarization and Smarter Notification Management

AI Daily: Alibaba and Tencent Fully Support MCP Protocol; Step-R1-V-Mini Multimodal Inference Model from Jieyue Xingchen; Meitu's Miracle F1 Image Generation Model

DeepSeek's Innovative SPCT Technology Enables LLMs to Better Understand Human Intent

NVIDIA Unveils Llama 3.1 Nemotron Ultra 253B: Redefining AI Performance Standards

NVIDIA Unveils Llama 3.1 Nemotron Ultra 253B: A New Benchmark in Performance