Frustrated by the snail's pace of large language models (LLMs) processing long texts? Don't worry! Tsinghua University has unveiled a groundbreaking technology—the APB parallel inference framework—essentially giving LLMs a turbocharged engine! Tests show this technology processes ultra-long texts up to 10 times faster than Flash Attention! Yes, you read that right, 10 times faster!

image.png

With the rise of ChatGPT and other LLMs, AI reading comprehension has significantly improved, easily handling texts exceeding 100,000 characters. However, traditional LLMs struggle with massive amounts of information. While the Transformer architecture is powerful, its core attention mechanism, akin to a "super scanner," expands exponentially with text length, drastically slowing processing speed.

To overcome this challenge, scientists from Tsinghua University, collaborating with various research institutions and tech giants, developed the APB framework. Its core innovation lies in the clever combination of "sequence parallelism" and "sparse attention."

image.png

Simply put, the APB framework is like a highly efficient collaborative team. It divides long texts into smaller chunks, assigning them to multiple GPU "team members" for parallel processing. Furthermore, APB equips each "member" with "local KV cache compression" and "streamlined communication" capabilities, enabling efficient sharing of crucial information while processing individual tasks, collaboratively resolving complex semantic dependencies within long texts.

Surprisingly, APB doesn't sacrifice performance for speed. In 128K ultra-long text tests, APB not only boasts incredible speed but also surpasses traditional Flash Attention in performance! It even outperforms Nvidia's Star Attention, achieving a 1.6x speed improvement—a true all-around ace.

This breakthrough directly reduces the first token response time for long text processing in LLMs. This means future LLMs incorporating APB will instantly understand and respond to lengthy user instructions, eliminating the frustrating "loading..." wait.

image.png

So, how does APB achieve such remarkable speed improvements?

APB addresses the core issue in long text processing: computational complexity. Traditional attention mechanisms' computational cost is proportional to the square of the text length, making long texts computationally expensive. To overcome this, APB employs two key strategies:

Strategy 1: Enhanced Parallelism – Many Hands Make Light Work

APB leverages distributed computing, distributing tasks across multiple GPUs, significantly increasing efficiency. Its sequence parallelism exhibits strong scalability, handling texts of any length regardless of model architecture.

Strategy 2: Reduced Inefficient Computations – Using Resources Wisely

APB incorporates a sparse attention mechanism, selectively computing attention. It focuses on key information, ignoring irrelevant parts, drastically reducing computational cost.

However, parallelism and sparsity, while seemingly simple, are complex. Efficiently implementing sparse attention within a sequence parallel framework is APB's true innovation.

In sequence parallel environments, each GPU only possesses partial text information. Achieving global sparse attention is challenging. Previous methods like Star Attention and APE either compromise performance or have limited applicability.

APB avoids the pitfall of large-scale communication, creating a low-communication sparse attention mechanism for sequence parallel scenarios. Key components include:

Smaller Anchor blocks: These act as "navigators," guiding the attention mechanism to key information. APB innovatively reduces Anchor block size, enhancing flexibility and lowering computational overhead.

Novel Passing blocks: These are APB's core components, cleverly solving long-range semantic dependency issues. They "compress and package" key information from preceding GPUs, passing it to subsequent GPUs, allowing each "team member" to understand the overall context.

Query-aware context compression: APB introduces a "query-aware" mechanism, enabling the context compressor to understand the query, more accurately selecting and retaining relevant information, further improving efficiency and accuracy.

APB's streamlined inference process:

Context splitting: Evenly distributes long text across GPUs, appending Anchor blocks at the beginning, embedding the query.

Context compression: Uses Locret's retained heads for "intelligent compression" of KV caches.

Efficient communication: Uses the AllGather operator to "transmit" compressed KV caches to subsequent GPUs, creating Passing blocks.

High-speed computation: Employs specialized Flash Attention Kernels and optimized attention masks for efficient computation. Passing blocks "retire" after computation, not participating further.

Experimental results demonstrate APB's superior performance. Across various models (Llama-3.1-8B-instruct, Qwen-2.5-14B-instruct, Yi-34B-200K) and benchmarks (InfiniteBench, RULER), APB consistently outperforms others, achieving optimal balance between performance and speed.

Importantly, APB's speed advantage becomes more pronounced with increasing text length, achieving a "faster with longer text" effect. This is because APB's computational cost is significantly lower than other methods, and the gap widens with longer texts.

Further pre-filling time breakdown analysis shows that sequence parallelism significantly reduces attention and FFN (feed-forward network) computation time. APB's sparse attention further minimizes attention computation time. Compared to Star Attention, APB cleverly uses Passing blocks to handle long-range semantic dependencies, significantly reducing Anchor block size and FFN overhead, achieving a win-win scenario.

Excitingly, APB exhibits excellent compatibility, adapting to various distributed environments and model scales, maintaining high performance and efficiency under various conditions.

With APB, the bottleneck of LLM long-text inference is broken, expanding the possibilities of AI applications. Whether in intelligent customer service, financial analysis, scientific research, or content creation, we're entering a faster, stronger, and more intelligent AI era!

Project address: https://github.com/thunlp/APB

Paper address: https://arxiv.org/pdf/2502.12085