In the field of artificial intelligence, training large language models (LLMs) has become an important direction for driving technological advancement. However, as the scale of models and datasets continues to grow, traditional optimization methods—especially AdamW—are gradually revealing their limitations. Researchers face a series of challenges, including high computational costs, unstable training, gradient vanishing or explosion, inconsistent updates to parameter matrices, and high resource demands in distributed environments. Therefore, there is an urgent need for more efficient and stable optimization techniques to address these complexities.

To tackle these challenges, Moonshot AI has collaborated with the University of California, Los Angeles (UCLA) to develop Moonlight, a Mixture-of-Expert (MoE) model that utilizes the Muon optimizer. Moonlight offers two configurations: one with 3 billion active parameters and another with a total of 16 billion parameters, trained on 570 trillion tokens. The innovation of the Muon optimizer lies in its use of the Newton-Schulz iteration method for matrix orthogonalization, ensuring uniformity of gradient updates in the model parameter space. This improvement provides a promising alternative to traditional AdamW, enhancing training efficiency and stability.

QQ_1740360210200.png

On a technical level, Moonlight made two key adjustments to the Muon optimizer. First, it introduced weight decay techniques to control the growth of weights during training of large models with extensive tokens. Second, it calibrated the update magnitude for each parameter, scaling it according to the square root of the maximum dimension of the weight matrix, thereby achieving consistency in updates.

Through empirical evaluation of Moonlight, researchers found that its performance at intermediate checkpoints outperformed traditional AdamW training models. For instance, in language understanding tasks, Moonlight achieved higher scores on the MMLU benchmark. In code generation tasks, the performance improvement was even more pronounced, indicating that the optimization mechanism of Muon positively contributes to task performance.

The successful implementation of the Moonlight project will set new standards for training large language models. The open-source implementation of the Muon optimizer, along with the release of pre-trained models and intermediate checkpoints, is expected to promote further research into scalable optimization techniques.

GitHub: https://github.com/MoonshotAI/Moonlight?tab=readme-ov-file

Hugging Face: https://huggingface.co/moonshotai/Moonlight-16B-A3B

Paper: https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf

Key Points:  

🌟 The Moonlight model is a Mixture-of-Expert model jointly developed by Moonshot AI and UCLA, offering configurations with 3 billion and 16 billion parameters, trained on 570 trillion tokens.  

⚙️ The Muon optimizer significantly improves the efficiency and stability of training large models through the Newton-Schulz iteration method and weight decay techniques.  

📈 Empirical results show that Moonlight outperforms traditional AdamW training models across multiple tasks, demonstrating better language understanding and code generation capabilities.