The ByteDance Doubao Large Model Team announced today that they have successfully developed a new sparse model architecture called UltraMem. This architecture effectively addresses the high memory access issues during inference of MoE (Mixture of Experts) models, achieving an inference speed improvement of 2-6 times compared to MoE, and reducing inference costs by up to 83%. This groundbreaking advancement opens new pathways for efficient inference of large models.

ByteDance Douyin Doubao Large Model

The UltraMem architecture successfully resolves the memory bottleneck encountered during inference of MoE models while ensuring model effectiveness. Experimental results indicate that under the same parameters and activation conditions, UltraMem not only outperforms MoE in model effectiveness but also increases inference speed by 2-6 times. Additionally, under common batch size scales, the memory cost of UltraMem is nearly equivalent to that of a Dense model with the same computational load, significantly lowering inference costs.

QQ20250212-140416.png

The research team trained an UltraMem model with a scale of 20 million values, and experimental results show that under equivalent computational resources, this model achieves both industry-leading inference speed and model performance. This achievement validates the excellent scaling characteristics of the UltraMem architecture and lays the technical foundation for building models with billions of values or experts.

As the scale of large models continues to expand, inference costs and speed have become critical factors limiting their application. Although the MoE architecture has achieved decoupling of computation and parameters, its high memory access requirements during inference lead to increased latency. The introduction of the UltraMem architecture effectively addresses this issue, providing a new technical option for the large-scale application of models.