Recently, Ant Group's Ling team published a technical paper titled "Every FLOP Matters: Scaling 300 Billion Parameter Mixture-of-Experts (MoE) LING Large Model Without High-End GPUs" on the preprint platform Arxiv. The paper introduces two new large language models: Ling-Lite and Ling-Plus. These models incorporate several innovations in their design, enabling efficient training on low-performance hardware and significantly reducing costs.

Ling-Lite boasts 16.8 billion parameters, with 2.75 billion activation parameters. The enhanced base model, Ling-Plus, has a massive 290 billion parameters, and 28.8 billion activation parameters. Both models achieve industry-leading performance; particularly, the enhanced version's 300 billion parameter MoE model, trained on low-performance devices using domestically produced GPUs, performs comparably to models trained on high-end Nvidia chips.

Accelerator, Speed Up, Light

Image Source Note: Image generated by AI, image licensing provided by Midjourney

Traditionally, training MoE models requires expensive high-performance GPUs like Nvidia's H100 and H800. This is not only costly but also constrained by chip shortages, limiting their application in resource-constrained environments. To address this, Ant Group's Ling team proposed a novel goal—"scaling models without high-end GPUs"—breaking through resource and budget limitations. Their innovative training strategies include dynamic parameter allocation, mixed-precision scheduling, and an upgraded training exception handling mechanism. These strategies effectively reduce interruption response time, optimize the model evaluation process, and compress the verification cycle by over 50%.

In experiments, the Ling team pre-trained Ling-Plus on 9 trillion tokens. Results show that training on 1 trillion tokens using high-performance hardware costs approximately 6.35 million RMB, while using Ant's optimized methods on lower-spec hardware reduces the cost to approximately 5.08 million RMB, saving nearly 20%. Performance is comparable to Alibaba's Tongyi Qwen2.5-72B-Instruct and DeepSeek-V2.5-1210-Chat.

If widely adopted, this technological achievement will provide a more cost-effective solution for domestic large models, reducing reliance on Nvidia chips and paving new paths for future AI development.