Moore Threads Open-Sources Two Major AI Frameworks, Achieving Over 90% Training Efficiency on Domestic GPUs

AIbase基地

Published inAI News · 6 min read · Mar 18, 2025

A powerful new tool for deep computing! Moore Threads today proudly announced the open-sourcing of two major AI frameworks: MT-MegatronLM and MT-TransformerEngine. This move will significantly boost the power of domestic computing infrastructure. These frameworks deeply integrate FP8 mixed-precision training strategies and high-performance operator libraries, successfully achieving mixed parallel training and inference on domestic general-purpose GPUs, dramatically improving the efficiency and stability of large model training.

Moore Threads' open-sourced MT-MegatronLM framework is designed for general-purpose GPUs, supporting efficient training of dense models, multimodal models, and MoE (Mixture of Experts) models, meeting the diverse training needs of the current AI field. MT-TransformerEngine focuses on optimizing the training and inference of Transformer models. Through operator fusion and parallel acceleration strategies, it effectively unleashes the high-density computing potential of Moore Threads' general-purpose GPUs, significantly improving the efficiency of memory-bound operators.

The technological breakthroughs of these two frameworks stem from the deep synergy between hardware adaptation and algorithmic innovation. First, they support mixed parallel training for various model types, flexibly handling complex computation scenarios of different model architectures. Second, combined with the FP8 mixed-precision training strategy natively supported by Moore Threads GPUs, training efficiency is significantly improved. Third, through deep integration of the high-performance operator library muDNN and the communication library MCCL, they systematically optimize computation-intensive tasks and communication overhead in multi-card collaboration. Simultaneously, integrating the open-source Simumax library enables automatic parallel strategy search, maximizing parallel training performance for different models and acceleration environments. Furthermore, the built-in rewind exception recovery mechanism automatically rolls back to the nearest stable node to continue training, significantly improving the stability of large-scale training. Finally, the two frameworks are compatible with the mainstream GPU ecosystem, ensuring smooth migration of existing ecosystems and providing underlying support for developers to build their own AI technology stacks.

In practical applications, the performance of these two frameworks is impressive. On a general-purpose GPU cluster, the training task of the Llama38B model, utilizing FP8 technology, achieves an MFU (Model throughput utilization) of over 90% with almost no loss in loss, representing a 28% increase in training speed compared to previous methods. Furthermore, Moore Threads has deeply integrated and open-sourced efficient support for the DeepSeek parallel algorithm DualPipe. After MT-DualPipe is fully integrated into the MT-Megatron framework and MT-TransformerEngine framework, it successfully replicates the DeepSeek V3 training process, supporting MLA, MTP, and various expert balancing strategies. Through various Transformer operator fusion techniques, these frameworks significantly improve memory bandwidth utilization, effectively alleviating memory-bound bottlenecks, and further unleashing the hardware potential of domestic GPUs.

Moore Threads stated that it will continue to optimize these two frameworks and plans to introduce a series of new features: including Dual Pipe/ZeroBubble parallel strategies to further reduce bubble rate and improve parallel training efficiency; various innovative FP8 optimization strategies to improve training performance and stability; asynchronous checkpoint strategies to improve fault tolerance and efficiency during training; optimized recomputation strategies to reduce computation and memory overhead and increase training speed; innovative fault-tolerant training algorithms to enhance fault tolerance during training; and integration of Moore Threads FlashMLA and DeepGemm libraries to further unleash the computing power and FP8 computing capabilities of Moore Threads GPUs, comprehensively improving computing performance and efficiency.

These technological breakthroughs and open-source initiatives not only demonstrate Moore Threads' strength in the AI computing field but also open up new possibilities for the development of domestic AI infrastructure. We look forward to seeing more breakthroughs in AI model training from them.

MooER: The Open-Source Audio Understanding Model by Moore Threads

Moore Threads has announced the open-source release of its audio understanding model, MooER, making it the first large-scale open-source speech model based on domestically produced full-feature GPUs. MooER supports Chinese and English speech recognition and translation, utilizing a three-part model structure that demonstrates robust multilingual processing capabilities. The inference code and a model trained on 5000 hours of data have been released as open source, with plans to further open-source training code and an enhanced version trained on 80,000 hours of data. In comparative testing, MooER-5K has shown outstanding performance, achieving a Chinese CER of 4.21% and an English WER of 17.98%, particularly.

Moore Threads and Shiyin Intelligence Reach Cooperation, Completing Adaptation of 'A Leaf Light Boat' Multimodal Large Model

Recently, Moore Threads and Shiyin Intelligence announced a strategic partnership to jointly promote the application of domestically produced full-function GPUs in industry large model solutions. The two parties have completed the adaptation work of the Moore Threads Kuaguo Computing Cluster with Shiyin Intelligence's 'A Leaf Light Boat' multimodal large model, integrating their respective advantageous resources in the field of artificial intelligence, and are committed to providing a more intelligent and efficient service experience.

Shizhe AI Collaborates with Moore Threads to Complete Training of 7 Billion Parameter Model

Recently, Moore Threads and the all-subject educational AI model 'Shizhe AI' jointly announced that they have completed the training and testing of their large model. Utilizing Moore Threads' KUAE supercomputing cluster, Shizhe AI successfully completed the intensive training of a 7 billion parameter model in one week, achieving the expected training efficiency and fully demonstrating the capabilities of the domestically produced all-functional GPU for trillion-scale training.

Moore-AnimateAnyone: A Single-Image Dance Project by Moore Threads, Allowing Users to Train Their Own AnimateAnyone Models

The single-image dance project by Moore Threads has been restored and the training code is now open source. Users can now train their own AnimateAnyone models. The project provides ComfyUI nodes based on Moore Threads, making single-image dancing very simple. Users can easily access the implementation of Moore-AnimateAnyone and use it in ComfyUI. Moore Threads has announced that the AI creation storybook 'MoBi Tian Shu' is open for invitation testing, providing users with a fully automated creation experience.

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview