Revolutionizing Long-Document Reasoning with APB: A 10x Speedup Over Flash Attention

AIbase基地

Published inAI News · 9 min read · Mar 13, 2025

Frustrated by the snail's pace of large language models (LLMs) processing long texts? Don't worry! Tsinghua University has unveiled a groundbreaking technology—the APB parallel inference framework—essentially giving LLMs a turbocharged engine! Tests show this technology processes ultra-long texts up to 10 times faster than Flash Attention! Yes, you read that right, 10 times faster!

With the rise of ChatGPT and other LLMs, AI reading comprehension has significantly improved, easily handling texts exceeding 100,000 characters. However, traditional LLMs struggle with massive amounts of information. While the Transformer architecture is powerful, its core attention mechanism, akin to a "super scanner," expands exponentially with text length, drastically slowing processing speed.

To overcome this challenge, scientists from Tsinghua University, collaborating with various research institutions and tech giants, developed the APB framework. Its core innovation lies in the clever combination of "sequence parallelism" and "sparse attention."

Simply put, the APB framework is like a highly efficient collaborative team. It divides long texts into smaller chunks, assigning them to multiple GPU "team members" for parallel processing. Furthermore, APB equips each "member" with "local KV cache compression" and "streamlined communication" capabilities, enabling efficient sharing of crucial information while processing individual tasks, collaboratively resolving complex semantic dependencies within long texts.

Surprisingly, APB doesn't sacrifice performance for speed. In 128K ultra-long text tests, APB not only boasts incredible speed but also surpasses traditional Flash Attention in performance! It even outperforms Nvidia's Star Attention, achieving a 1.6x speed improvement—a true all-around ace.

This breakthrough directly reduces the first token response time for long text processing in LLMs. This means future LLMs incorporating APB will instantly understand and respond to lengthy user instructions, eliminating the frustrating "loading..." wait.

So, how does APB achieve such remarkable speed improvements?

APB addresses the core issue in long text processing: computational complexity. Traditional attention mechanisms' computational cost is proportional to the square of the text length, making long texts computationally expensive. To overcome this, APB employs two key strategies:

Strategy 1: Enhanced Parallelism – Many Hands Make Light Work

APB leverages distributed computing, distributing tasks across multiple GPUs, significantly increasing efficiency. Its sequence parallelism exhibits strong scalability, handling texts of any length regardless of model architecture.

Strategy 2: Reduced Inefficient Computations – Using Resources Wisely

APB incorporates a sparse attention mechanism, selectively computing attention. It focuses on key information, ignoring irrelevant parts, drastically reducing computational cost.

However, parallelism and sparsity, while seemingly simple, are complex. Efficiently implementing sparse attention within a sequence parallel framework is APB's true innovation.

In sequence parallel environments, each GPU only possesses partial text information. Achieving global sparse attention is challenging. Previous methods like Star Attention and APE either compromise performance or have limited applicability.

APB avoids the pitfall of large-scale communication, creating a low-communication sparse attention mechanism for sequence parallel scenarios. Key components include:

Smaller Anchor blocks: These act as "navigators," guiding the attention mechanism to key information. APB innovatively reduces Anchor block size, enhancing flexibility and lowering computational overhead.

Novel Passing blocks: These are APB's core components, cleverly solving long-range semantic dependency issues. They "compress and package" key information from preceding GPUs, passing it to subsequent GPUs, allowing each "team member" to understand the overall context.

Query-aware context compression: APB introduces a "query-aware" mechanism, enabling the context compressor to understand the query, more accurately selecting and retaining relevant information, further improving efficiency and accuracy.

APB's streamlined inference process:

Context splitting: Evenly distributes long text across GPUs, appending Anchor blocks at the beginning, embedding the query.

Context compression: Uses Locret's retained heads for "intelligent compression" of KV caches.

Efficient communication: Uses the AllGather operator to "transmit" compressed KV caches to subsequent GPUs, creating Passing blocks.

High-speed computation: Employs specialized Flash Attention Kernels and optimized attention masks for efficient computation. Passing blocks "retire" after computation, not participating further.

Experimental results demonstrate APB's superior performance. Across various models (Llama-3.1-8B-instruct, Qwen-2.5-14B-instruct, Yi-34B-200K) and benchmarks (InfiniteBench, RULER), APB consistently outperforms others, achieving optimal balance between performance and speed.

Importantly, APB's speed advantage becomes more pronounced with increasing text length, achieving a "faster with longer text" effect. This is because APB's computational cost is significantly lower than other methods, and the gap widens with longer texts.

Further pre-filling time breakdown analysis shows that sequence parallelism significantly reduces attention and FFN (feed-forward network) computation time. APB's sparse attention further minimizes attention computation time. Compared to Star Attention, APB cleverly uses Passing blocks to handle long-range semantic dependencies, significantly reducing Anchor block size and FFN overhead, achieving a win-win scenario.

Excitingly, APB exhibits excellent compatibility, adapting to various distributed environments and model scales, maintaining high performance and efficiency under various conditions.

With APB, the bottleneck of LLM long-text inference is broken, expanding the possibilities of AI applications. Whether in intelligent customer service, financial analysis, scientific research, or content creation, we're entering a faster, stronger, and more intelligent AI era!

Project address: https://github.com/thunlp/APB

Paper address: https://arxiv.org/pdf/2502.12085

China Academy of Information and Communications Technology's Artificial Intelligence Institute Jointly Released 'Research Report on the Application of Large Model Integrated Machines (2025)'

China Academy of Information and Communications Technology and the Artificial Intelligence Industry Development Alliance released 'Research Report on the Application of Large Model Integrated Machines (2025)', analyzing technical evolution, industry dynamics, and application practices, providing enterprises with comprehensive references. The report outlines the development history of large model integrated machines, highlights significant progress in recent years, and focuses on changes at the technical level.

Moonshot AI Launches Kimi Linear: 6 Times Faster Linear Attention Architecture, Open-Source KDA Kernel Released Simultaneously

The domestic team Moonshot AI released the technical report on the Kimi Linear architecture, proposing a hybrid linear architecture that can replace the full attention mechanism. This architecture achieves breakthroughs in speed, memory efficiency, and long context processing, significantly reducing the use of KV cache, combining efficiency with performance advantages, and is called the new starting point for attention mechanisms in the era of intelligent agents.

OpenAI Launches Aardvark: An Intelligent Security Research Assistant to Enhance Software Protection

OpenAI has launched Aardvark, an intelligent security assistant based on GPT-5, to help developers and security teams efficiently address the challenge of thousands of new vulnerabilities each year. The tool continuously analyzes source code, automatically identifies vulnerabilities, assesses risks, prioritizes them, and provides remediation solutions, significantly improving the efficiency of software security protection.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services

AI Search Visibility Checker

AI Model Compatibility Checker

AI Deployment Calculator

AI Dataset Collection

Intelligent Document Recognition

Revolutionizing Long-Document Reasoning with APB: A 10x Speedup Over Flash Attention

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Moonshot Introduces a New Hybrid Linear Attention Architecture Kimi Linear

AI Daily: Sora's Free Quota to Shrink; Moonshot Releases Kimi Linear Architecture; Canva Freely Releases Affinity Professional Design Suite

Moonshot Launches Kimi Linear Architecture: KV Cache Reduced by 75%, Inference Speed Increased by 6 Times, Attention Mechanism Sees Groundbreaking Innovation!

China Academy of Information and Communications Technology's Artificial Intelligence Institute Jointly Released 'Research Report on the Application of Large Model Integrated Machines (2025)'

World's First Embodied Intelligence Open Platform Launches! 3D Digital Humans Now Ready to Use Out of the Box: Mofa Xingyun Integrates Large Models into Hundreds of Yuan Chips

Moonshot AI Launches Kimi Linear: 6 Times Faster Linear Attention Architecture, Open-Source KDA Kernel Released Simultaneously

OpenAI Launches Aardvark: An Intelligent Security Research Assistant to Enhance Software Protection

OpenAI and Oracle to Build a Super Large Data Center in Michigan!

Meta Researchers Uncover the Black Box of Large Language Models and Fix AI Reasoning Flaws

Zhiyuan Launches Emu3.5 Large Model: Reconstructing Multimodal Intelligence with Next-State Prediction, Embodied Operational Capabilities Amaze the Industry

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Deployment Calculator

AI Dataset Collection

Intelligent Document Recognition

Revolutionizing Long-Document Reasoning with APB: A 10x Speedup Over Flash Attention

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Moonshot Introduces a New Hybrid Linear Attention Architecture Kimi Linear

AI Daily: Sora's Free Quota to Shrink; Moonshot Releases Kimi Linear Architecture; Canva Freely Releases Affinity Professional Design Suite

Moonshot Launches Kimi Linear Architecture: KV Cache Reduced by 75%, Inference Speed Increased by 6 Times, Attention Mechanism Sees Groundbreaking Innovation!

China Academy of Information and Communications Technology's Artificial Intelligence Institute Jointly Released 'Research Report on the Application of Large Model Integrated Machines (2025)'

World's First Embodied Intelligence Open Platform Launches! 3D Digital Humans Now Ready to Use Out of the Box: Mofa Xingyun Integrates Large Models into Hundreds of Yuan Chips

Moonshot AI Launches Kimi Linear: 6 Times Faster Linear Attention Architecture, Open-Source KDA Kernel Released Simultaneously

OpenAI Launches Aardvark: An Intelligent Security Research Assistant to Enhance Software Protection

OpenAI and Oracle to Build a Super Large Data Center in Michigan!

Meta Researchers Uncover the Black Box of Large Language Models and Fix AI Reasoning Flaws

Zhiyuan Launches Emu3.5 Large Model: Reconstructing Multimodal Intelligence with Next-State Prediction, Embodied Operational Capabilities Amaze the Industry

GEO Services