AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

The Dark Side of the Moon Collaborates with UCLA to Launch a New Mixture-of-Expert Model, Enhancing the Training Efficiency of Language Models

AIbase基地

Published inAI News · 5 min read · Feb 24, 2025

178

In the field of artificial intelligence, training large language models (LLMs) has become an important direction for driving technological advancement. However, as the scale of models and datasets continues to grow, traditional optimization methods—especially AdamW—are gradually revealing their limitations. Researchers face a series of challenges, including high computational costs, unstable training, gradient vanishing or explosion, inconsistent updates to parameter matrices, and high resource demands in distributed environments. Therefore, there is an urgent need for more efficient and stable optimization techniques to address these complexities.

To tackle these challenges, Moonshot AI has collaborated with the University of California, Los Angeles (UCLA) to develop Moonlight, a Mixture-of-Expert (MoE) model that utilizes the Muon optimizer. Moonlight offers two configurations: one with 3 billion active parameters and another with a total of 16 billion parameters, trained on 570 trillion tokens. The innovation of the Muon optimizer lies in its use of the Newton-Schulz iteration method for matrix orthogonalization, ensuring uniformity of gradient updates in the model parameter space. This improvement provides a promising alternative to traditional AdamW, enhancing training efficiency and stability.

On a technical level, Moonlight made two key adjustments to the Muon optimizer. First, it introduced weight decay techniques to control the growth of weights during training of large models with extensive tokens. Second, it calibrated the update magnitude for each parameter, scaling it according to the square root of the maximum dimension of the weight matrix, thereby achieving consistency in updates.

Through empirical evaluation of Moonlight, researchers found that its performance at intermediate checkpoints outperformed traditional AdamW training models. For instance, in language understanding tasks, Moonlight achieved higher scores on the MMLU benchmark. In code generation tasks, the performance improvement was even more pronounced, indicating that the optimization mechanism of Muon positively contributes to task performance.

The successful implementation of the Moonlight project will set new standards for training large language models. The open-source implementation of the Muon optimizer, along with the release of pre-trained models and intermediate checkpoints, is expected to promote further research into scalable optimization techniques.

GitHub: https://github.com/MoonshotAI/Moonlight?tab=readme-ov-file

Hugging Face: https://huggingface.co/moonshotai/Moonlight-16B-A3B

Paper: https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf

Key Points:

🌟 The Moonlight model is a Mixture-of-Expert model jointly developed by Moonshot AI and UCLA, offering configurations with 3 billion and 16 billion parameters, trained on 570 trillion tokens.

⚙️ The Muon optimizer significantly improves the efficiency and stability of training large models through the Newton-Schulz iteration method and weight decay techniques.

📈 Empirical results show that Moonlight outperforms traditional AdamW training models across multiple tasks, demonstrating better language understanding and code generation capabilities.

LargeLanguageModel Moonlight MuonEnhancer AdamW

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

MLX-LM Seamlessly Integrated with Hugging Face to Boost Efficient Large Language Model Performance on Apple Silicon Devices

May 20, 2025

660

ByteDance Unveils QuaDMix: A Unified Framework for Large Language Model Pre-training Data Quality and Diversity

Apr 28, 2025

820

Zhipu AI and Shengshu Technology Announce Strategic Partnership to Focus on Large Model Joint Innovation

On April 27, Zhipu AI (Z.ai) and Shengshu Technology (shengshu.com), two leading artificial intelligence companies under Tsinghua University, announced a major strategic partnership. This collaboration aims to leverage both companies' technological expertise in large language models and multi-modal generative models to jointly advance the technological innovation and industrial application of domestic large models.

Apr 27, 2025

500

Doubao 1.5 Deep Thinking Model Launches on Edge Large Model Gateway with Free Million Tokens

Bytedance's Volcano Engine announced the full launch of its newly released Doubao 1.5 Deep Thinking model on the edge large model gateway, offering users up to 5 million free tokens. This move has garnered significant attention in the AI field.

Apr 25, 2025

1.7k

GPT-4.1 Model Faces Scrutiny: Alignment and Stability Concerns Raised

Apr 24, 2025

850

ByteDance Releases Efficient Pre-training Length Scaling Technology, Breaking Through Long Sequence Training Bottlenecks

Apr 23, 2025

900

Microsoft MarkItDown MCP Converts Word, Excel, and more to Markdown

Apr 21, 2025

1.1k

OpenAI Releases Practical Guide to Building Intelligent Agents (with Resources)

This practical guide from OpenAI provides a hands-on approach to building intelligent agents. Includes helpful resources and documentation to aid in the development process.

Apr 18, 2025

1.5k

LMArena Officially Launches, Dedicated to Providing a Neutral AI Evaluation Platform

Apr 18, 2025

580

Microsoft Unveils New Language Model BitNet b1.58 2B4T, Requiring Only 0.4GB of Memory

Apr 18, 2025

1.6k