Researchers at NVIDIA, a leading AI chip company, recently unveiled a groundbreaking architecture optimization technique called "FFN Fusion." This technique aims to significantly improve the inference efficiency of large language models (LLMs) by addressing the inherent serial computation bottleneck in the Transformer architecture, paving the way for wider deployment of high-performance AI applications.
In recent years, large language models have demonstrated remarkable capabilities in natural language processing, scientific research, and conversational agents. However, as model size and complexity increase, the computational resources required for inference also grow significantly, leading to efficiency bottlenecks. The Transformer architecture, the foundation of LLMs, utilizes alternating attention mechanisms and feed-forward network (FFN) layers that process inputs sequentially. This inherent serial structure, when scaled up, dramatically increases computation and communication costs between GPUs, reducing efficiency and raising deployment costs. This issue is particularly pronounced in scenarios requiring rapid generation of multiple tokens, such as real-time AI assistants.
To address this challenge, NVIDIA researchers proposed FFN Fusion. The core idea is to merge consecutive, loosely dependent FFN layers into a wider single FFN. Researchers observed that after removing attention layers, long sequences of consecutive FFNs often exist in LLMs. By analyzing these sequences, they found that the dependency between these FFN layers is minimal, allowing for parallel execution.
The mathematical foundation of FFN Fusion lies in concatenating the weights of multiple serially connected FFNs to create an equivalent single module that can be computed in parallel. For instance, if three FFNs are stacked sequentially, with each FFN's output serving as the input to the next, FFN Fusion eliminates this dependency, enabling the three FFNs to process the same input concurrently and aggregate their outputs. Theoretical analysis shows that the fused FFN retains the same representational capacity as the original FFNs.
Ultra-253B-Base: Enhanced Performance and Efficiency
NVIDIA researchers applied FFN Fusion to Meta's Llama-3.1-405B-Instruct model, creating a new model called Ultra-253B-Base through pruning and reconstruction. Experimental results demonstrate significant improvements in inference speed and resource efficiency. Specifically, the model achieved a 1.71x reduction in inference latency and a 35x reduction in computation cost per token at a batch size of 32.
Even more impressive, this efficiency gain did not come at the cost of model capability. Ultra-253B-Base achieved excellent results on several authoritative benchmark datasets, such as: MMLU 85.17%, MMLU-Pro 72.25%, HumanEval 86.58%, Arena Hard 84.92%, MT-Bench 9.19. These results are often comparable to or even better than the original 405 billion-parameter model, while Ultra-253B-Base contains only 253 billion parameters. Furthermore, the model's memory usage was halved, thanks to kv-cache optimization.
Researchers used cosine distance analysis of the outputs between FFN layers to identify regions with low interdependency, which are the optimal candidates for fusion. FFN Fusion has been validated on models of different sizes (including 49 billion, 70 billion, and 253 billion parameters), demonstrating its good generalizability.
This research shows that through in-depth analysis and clever architectural design, the efficiency of LLMs can be significantly improved. FFN Fusion lays the foundation for designing more parallelizable and hardware-friendly LLMs. While the parallelization of the entire Transformer module faces more challenges due to stronger inter-layer dependencies, the success of FFN Fusion undoubtedly points to a crucial direction for future LLM efficiency optimization.