Star-Attention

EfficientInference Technology for Long Sequence Large Language Models

CommonProductProgrammingNVIDIALarge Language Models
Star-Attention is a novel block-sparse attention mechanism proposed by NVIDIA aimed at improving the inference efficiency of large language models (LLMs) based on Transformers for long sequences. This technology significantly boosts inference speed through a two-stage operation while maintaining an accuracy rate of 95-100%. It is compatible with most Transformer-based LLMs, allowing for direct use without additional training or fine-tuning, and can be combined with other optimization methods such as Flash Attention and KV cache compression techniques to further enhance performance.
Visit

Star-Attention Visit Over Time

Monthly Visits

494758773

Bounce Rate

37.69%

Page per Visit

5.7

Visit Duration

00:06:29

Star-Attention Visit Trend

Star-Attention Visit Geography

Star-Attention Traffic Sources

Star-Attention Alternatives