Moonlight is a 16B parameter Mixture of Experts (MoE) model trained using the Muon optimizer, demonstrating outstanding performance in large-scale training. By incorporating weight decay and adjusting parameter update ratios, it significantly enhances training efficiency and stability. This model surpasses existing models in various benchmarks while substantially reducing the computational resources required for training. Moonlight's open-source implementation and pre-trained models provide researchers and developers with a powerful toolset, supporting diverse natural language processing tasks such as text generation and code generation.