NVIDIA AI Researchers Introduce FFN Fusion Technology: Accelerating Large Language Model Inference

AIbase基地

Published inAI News · 6 min read · Mar 31, 2025

Researchers at NVIDIA, a leading AI chip company, recently unveiled a groundbreaking architecture optimization technique called "FFN Fusion." This technique aims to significantly improve the inference efficiency of large language models (LLMs) by addressing the inherent serial computation bottleneck in the Transformer architecture, paving the way for wider deployment of high-performance AI applications.

In recent years, large language models have demonstrated remarkable capabilities in natural language processing, scientific research, and conversational agents. However, as model size and complexity increase, the computational resources required for inference also grow significantly, leading to efficiency bottlenecks. The Transformer architecture, the foundation of LLMs, utilizes alternating attention mechanisms and feed-forward network (FFN) layers that process inputs sequentially. This inherent serial structure, when scaled up, dramatically increases computation and communication costs between GPUs, reducing efficiency and raising deployment costs. This issue is particularly pronounced in scenarios requiring rapid generation of multiple tokens, such as real-time AI assistants.

To address this challenge, NVIDIA researchers proposed FFN Fusion. The core idea is to merge consecutive, loosely dependent FFN layers into a wider single FFN. Researchers observed that after removing attention layers, long sequences of consecutive FFNs often exist in LLMs. By analyzing these sequences, they found that the dependency between these FFN layers is minimal, allowing for parallel execution.

The mathematical foundation of FFN Fusion lies in concatenating the weights of multiple serially connected FFNs to create an equivalent single module that can be computed in parallel. For instance, if three FFNs are stacked sequentially, with each FFN's output serving as the input to the next, FFN Fusion eliminates this dependency, enabling the three FFNs to process the same input concurrently and aggregate their outputs. Theoretical analysis shows that the fused FFN retains the same representational capacity as the original FFNs.

Ultra-253B-Base: Enhanced Performance and Efficiency

NVIDIA researchers applied FFN Fusion to Meta's Llama-3.1-405B-Instruct model, creating a new model called Ultra-253B-Base through pruning and reconstruction. Experimental results demonstrate significant improvements in inference speed and resource efficiency. Specifically, the model achieved a 1.71x reduction in inference latency and a 35x reduction in computation cost per token at a batch size of 32.

Even more impressive, this efficiency gain did not come at the cost of model capability. Ultra-253B-Base achieved excellent results on several authoritative benchmark datasets, such as: MMLU 85.17%, MMLU-Pro 72.25%, HumanEval 86.58%, Arena Hard 84.92%, MT-Bench 9.19. These results are often comparable to or even better than the original 405 billion-parameter model, while Ultra-253B-Base contains only 253 billion parameters. Furthermore, the model's memory usage was halved, thanks to kv-cache optimization.

Researchers used cosine distance analysis of the outputs between FFN layers to identify regions with low interdependency, which are the optimal candidates for fusion. FFN Fusion has been validated on models of different sizes (including 49 billion, 70 billion, and 253 billion parameters), demonstrating its good generalizability.

This research shows that through in-depth analysis and clever architectural design, the efficiency of LLMs can be significantly improved. FFN Fusion lays the foundation for designing more parallelizable and hardware-friendly LLMs. While the parallelization of the entire Transformer module faces more challenges due to stronger inter-layer dependencies, the success of FFN Fusion undoubtedly points to a crucial direction for future LLM efficiency optimization.

Paper: https://arxiv.org/abs/2503.18908

Breaking Free from Stiff AI: Midjourney and NYU Unlock New Dimensions in Creative Text Generation, Diversity Soars by 23%!

Researchers from Midjourney and New York University have collaborated on a novel approach to significantly enhance the diversity of creative text generated by language models while minimizing quality loss. Detailed in a recent research paper, this technique centers on incorporating a 'deviation metric' into the AI's training process. It works by quantifying the difference between each generated text and other texts created for the same prompt. Researchers utilize text embeddings and their pairwise cosine distances to calculate these differences, thereby providing the system with...

FuriosaAI, a South Korean AI Chip Startup, Rejects $800 Million Acquisition Bid from Meta

South Korean media reports that FuriosaAI, a startup specializing in AI application chips, recently rejected an $800 million acquisition offer from tech giant Meta. FuriosaAI stated it will continue focusing on independent R&D and production of its AI chips. The report indicates that the breakdown of acquisition talks stemmed from disagreements on business strategy and organizational structure, not the acquisition price itself. Similar to many tech companies building large language models (LLMs), Meta...