A powerful new tool for deep computing! Moore Threads today proudly announced the open-sourcing of two major AI frameworks: MT-MegatronLM and MT-TransformerEngine. This move will significantly boost the power of domestic computing infrastructure. These frameworks deeply integrate FP8 mixed-precision training strategies and high-performance operator libraries, successfully achieving mixed parallel training and inference on domestic general-purpose GPUs, dramatically improving the efficiency and stability of large model training.

Moore Threads' open-sourced MT-MegatronLM framework is designed for general-purpose GPUs, supporting efficient training of dense models, multimodal models, and MoE (Mixture of Experts) models, meeting the diverse training needs of the current AI field. MT-TransformerEngine focuses on optimizing the training and inference of Transformer models. Through operator fusion and parallel acceleration strategies, it effectively unleashes the high-density computing potential of Moore Threads' general-purpose GPUs, significantly improving the efficiency of memory-bound operators.

image.png

The technological breakthroughs of these two frameworks stem from the deep synergy between hardware adaptation and algorithmic innovation. First, they support mixed parallel training for various model types, flexibly handling complex computation scenarios of different model architectures. Second, combined with the FP8 mixed-precision training strategy natively supported by Moore Threads GPUs, training efficiency is significantly improved. Third, through deep integration of the high-performance operator library muDNN and the communication library MCCL, they systematically optimize computation-intensive tasks and communication overhead in multi-card collaboration. Simultaneously, integrating the open-source Simumax library enables automatic parallel strategy search, maximizing parallel training performance for different models and acceleration environments. Furthermore, the built-in rewind exception recovery mechanism automatically rolls back to the nearest stable node to continue training, significantly improving the stability of large-scale training. Finally, the two frameworks are compatible with the mainstream GPU ecosystem, ensuring smooth migration of existing ecosystems and providing underlying support for developers to build their own AI technology stacks.

image.png

In practical applications, the performance of these two frameworks is impressive. On a general-purpose GPU cluster, the training task of the Llama38B model, utilizing FP8 technology, achieves an MFU (Model throughput utilization) of over 90% with almost no loss in loss, representing a 28% increase in training speed compared to previous methods. Furthermore, Moore Threads has deeply integrated and open-sourced efficient support for the DeepSeek parallel algorithm DualPipe. After MT-DualPipe is fully integrated into the MT-Megatron framework and MT-TransformerEngine framework, it successfully replicates the DeepSeek V3 training process, supporting MLA, MTP, and various expert balancing strategies. Through various Transformer operator fusion techniques, these frameworks significantly improve memory bandwidth utilization, effectively alleviating memory-bound bottlenecks, and further unleashing the hardware potential of domestic GPUs.

Moore Threads stated that it will continue to optimize these two frameworks and plans to introduce a series of new features: including Dual Pipe/ZeroBubble parallel strategies to further reduce bubble rate and improve parallel training efficiency; various innovative FP8 optimization strategies to improve training performance and stability; asynchronous checkpoint strategies to improve fault tolerance and efficiency during training; optimized recomputation strategies to reduce computation and memory overhead and increase training speed; innovative fault-tolerant training algorithms to enhance fault tolerance during training; and integration of Moore Threads FlashMLA and DeepGemm libraries to further unleash the computing power and FP8 computing capabilities of Moore Threads GPUs, comprehensively improving computing performance and efficiency.

These technological breakthroughs and open-source initiatives not only demonstrate Moore Threads' strength in the AI computing field but also open up new possibilities for the development of domestic AI infrastructure. We look forward to seeing more breakthroughs in AI model training from them.