In this era of information explosion, artificial intelligence shines like a constellation of brilliant stars, illuminating the night sky of human wisdom. Among these stars, the Transformer architecture stands out as the most dazzling, leading the new era of natural language processing with its self-attention mechanism at its core.

However, even the brightest stars have their unreachable corners. For Transformer models dealing with long contexts, the high resource consumption of self-attention calculations poses a significant challenge. Imagine trying to make an AI understand an article of tens of thousands of words, where every word must be compared with every other word in the text—the computational load is immense.

To address this issue, a group of scientists from Zyphra and EleutherAI have proposed a novel method called Tree Attention.

image.png

Self-attention, the core of the Transformer model, has a computational complexity that grows quadratically with the sequence length. This becomes a significant hurdle, especially for large language models (LLMs) when processing long texts.

The advent of Tree Attention is akin to planting trees in this computational forest, each capable of efficient calculations. It decomposes the self-attention calculation into multiple parallel tasks through a tree-reduction approach, with each task representing a leaf on the tree, collectively forming a complete tree structure.

Even more astonishing is that the proposers of Tree Attention have derived an energy function for self-attention, providing a Bayesian interpretation and linking it closely with energy models like Hopfield networks.

Tree Attention also takes into account the network topology of modern GPU clusters, intelligently utilizing the high-bandwidth connections within the cluster to reduce cross-node communication needs and thereby enhancing computational efficiency.

Through a series of experiments, scientists have validated the performance of Tree Attention under different sequence lengths and GPU counts. The results show that Tree Attention is up to 8 times faster than the existing Ring Attention method when decoding on multiple GPUs, significantly reducing communication volume and peak memory usage.

The proposal of Tree Attention not only offers an efficient solution for the computation of long-context attention models but also provides new insights into understanding the internal mechanisms of Transformer models. As AI technology continues to advance, we have reason to believe that Tree Attention will play a significant role in future AI research and applications.

Paper link: https://mp.weixin.qq.com/s/U9FaE6d-HJGsUs7u9EKKuQ