Tencent Cloud recently launched an upgraded version of the Xing Mai Network 2.0, aiming to enhance the efficiency of large-scale model training. In the previous version, the synchronization communication time for the computation results of large models accounted for over 50% of the total, resulting in low efficiency. The new version of Xing Mai Network 2.0 has been upgraded in several aspects:
1. Supports networking of up to 100,000 cards in a single cluster, doubling the scale, with a 60% improvement in network communication efficiency and a 20% increase in large model training efficiency. Fault location is reduced from a daily scale to a minute-scale.
2. Self-developed switches, optical modules, network interface cards, and other network devices have been upgraded, making the infrastructure more reliable and supporting the scale of more than 100,000 cards in a single cluster for GPU.
3. The new communication protocol TiTa2.0 is deployed on network interface cards, and the congestion control algorithm is upgraded to active congestion control, resulting in a 30% improvement in communication efficiency and a 10% increase in large model training efficiency.
4. The high-performance collective communication library TCCL2.0 utilizes NVLINK+NET heterogeneous parallel communication to achieve parallel data transfer, and it also has the Auto-Tune Network Expert adaptive algorithm to enhance communication performance by 30% and large model training efficiency by 10%.
5. A new exclusive technology from Tencent, the Virtual Reality Simulation Platform, enables comprehensive monitoring of the cluster network and precise identification of GPU node issues, reducing the fault location time for 10K-card training from a daily scale to a minute-scale.
With these upgrades, the communication efficiency of Xing Mai Network has increased by 60%, the efficiency of large model training by 20%, and the accuracy of fault location has also been improved. These improvements will help to enhance the efficiency and performance of large-scale model training, making expensive GPU resources more fully utilized.