Qafind Labs proudly announces its latest development: the ChatDLM model. This groundbreaking achievement has garnered significant attention in the field of artificial intelligence. ChatDLM uniquely and deeply integrates "Block Diffusion" and "Mixture of Experts (MoE)," achieving an astonishing inference speed of 2,800 tokens/s on a GPU and supporting an ultra-large context window of 131,072 tokens, enabling document-level generation and real-time interaction.

微信截图_20250428082020.png

ChatDLM's key feature lies in its unique technical architecture. Utilizing 7B parameters, the model employs block diffusion technology to divide input into blocks, combining spatial diffusion and inter-block attention mechanisms to significantly improve processing speed. Simultaneously, ChatDLM integrates MoE technology, deploying 32-64 experts and employing a flexible mechanism selecting two experts at a time for processing, further optimizing model performance.

To handle ultra-large contexts, ChatDLM adopts RoPE optimization and hierarchical caching technology, significantly enhancing the model's memory capacity. For inference optimization, techniques such as dynamic early stopping, BF16 mixed precision, and ZeRO sharding enable easy scaling to multiple GPUs, further improving model efficiency and scalability.

Performance tests on A100 GPUs demonstrated ChatDLM's superior performance: a throughput of 2800 tokens/s, a context length of 131,072 tokens, and an average iteration step count of 12-25. It achieved 92.0% accuracy on the HumanEval (0-shot) test, 84.2% on the Fill-in-the-Middle test, and 83.9% on the ARC-E (0-shot) test, fully demonstrating its excellent capabilities.

Looking ahead, Qafind Labs plans to integrate advanced technologies such as Adaptive Iteration, Graph-Attention, and Multimodal Diffusion into ChatDLM to further enhance its accuracy and applicability.

Experience it here: https://www.chatdlm.cn