Tencent recently released the 2.0 version of its independently developed Xingmai Network, a significant upgrade that brings a substantial performance boost to large-scale artificial intelligence model training. The new version has achieved breakthroughs in multiple aspects, including network scale, hardware performance, communication protocols, and fault diagnosis.
It is reported that in terms of network scale, Xingmai Network 2.0 supports networking of 100,000 card groups in a single cluster, providing strong infrastructure support for large-scale AI training. This expansion lays the foundation for future larger-scale AI model training.
Image Source Note: The image is generated by AI, and the image is authorized by Midjourney
In terms of hardware upgrades, Tencent's independently developed switch capacity has doubled from 25.6T to 51.2T, and the self-developed silicon optical module speed has upgraded from 200G to 400G, doubling the speed. The new version also incorporates an independently developed computing network card, making the whole machine communication bandwidth reach 3.2T, leading the industry. These hardware upgrades provide a solid foundation for a significant boost in network performance.
In terms of communication protocols, Tencent has launched a new TiTa2.0 protocol, which has shifted its deployment location from switches to network cards. At the same time, the congestion control algorithm has been upgraded to an active congestion control algorithm. These optimizations have improved communication efficiency by 30% and the efficiency of large model training by 10%.
Additionally, Tencent has introduced a new high-performance collective communication library, TCCL2.0. This library utilizes NVLINK+NET heterogeneous parallel communication technology to achieve parallel data transmission. Coupled with the Auto-Tune Network Expert adaptive algorithm, the system can automatically adjust various parameters based on differences in machine types, network scale, and model algorithms. This upgrade further improves communication performance by 30% and the efficiency of large model training by an additional 10%.
Notably, the combined effects of TiTa and TCCL upgrades result in a total 60% increase in communication efficiency for Xingmai Network and an overall 20% increase in large model training efficiency. This significant performance boost will greatly accelerate the training process of AI models, providing researchers and developers with a more efficient working environment.