Although the Transformer model is powerful, its efficiency issue during decoding has always been a headache. However, researchers from the Korea Advanced Institute of Science and Technology, LG, and DeepMind have given us a surprise this time—they have proposed a new Transformer architecture called Block Transformer, which directly boosts decoding speed by 10 to 20 times!

But how did they achieve this? It turns out that they "chunked" the attention mechanism of Transformer. This completely overthrows the inefficient way of the original Transformer, which requires accessing the global KV cache every time a Token is generated.

image.png

The researchers analyzed the shortcomings of the original Transformer: the effective utilization of GPU was only about 1%, with the remaining 99% spent on memory access. This is clearly unreasonable, so they proposed the Block Transformer. This new architecture, through the decomposition of block-level and block-within attention, directly increases the model's inference throughput.

Specifically, the working process of Block Transformer is as follows: first, divide the sequence into chunks, then convert each chunk into an embedding vector using the Embedder. The Block Decoder is responsible for processing block embedding vectors and capturing global dependencies between blocks; the Token Decoder handles local dependencies between Tokens to generate the Token sequence.

image.png

This method not only improves inference speed but also significantly reduces memory consumption. Some netizens said that they had similar ideas before, but the model's performance was insufficient. Now this method seems to have effectively reduced the KV cache.

Moreover, Block Transformer achieves accuracy on multiple zero-shot tasks that is comparable to and even slightly higher than that of the original Transformer of the same size, proving that it does not sacrifice quality while improving efficiency.

The significance of this research goes beyond this. It also reduces the training cost of the model, reducing the quadratic memory access overhead of global attention by 16 times, and increasing GPU utilization from 1% to 44%.

Paper address: https://arxiv.org/abs/2406.02657