DeepSeek officially launched its latest technological achievement, FlashMLA, on the first day of the open-source week. This is an efficient Multi-Layer Attention decoding kernel specifically designed for NVIDIA's Hopper architecture GPUs. The technology is particularly optimized for variable-length sequence scenarios, significantly enhancing the inference performance of large models.

QQ20250224-101526.png

Key technical features of FlashMLA include comprehensive support for BF16 precision and a paged key-value cache system with a block size of 64, enabling more precise memory management. In terms of performance, based on the CUDA 12.6 platform, FlashMLA achieved remarkable results on the H800SXM5 GPU: reaching a processing speed of 3000GB/s in memory-constrained scenarios and achieving a computational power level of 580TFLOPS in compute-constrained scenarios.

The project has been validated in production environments, demonstrating excellent stability. The development team stated that the design of FlashMLA draws from the best practices of projects like FlashAttention2 & 3 and Cutlass, achieving innovative breakthroughs on that foundation.

Developers can quickly deploy FlashMLA with a simple installation command: just execute "python setup.py install" to complete the installation, followed by running the test script "python tests/test_flash_mla.py" to experience its performance.

Open-source address: https://github.com/deepseek-ai/FlashMLA