Recently, the development of large language models (LLMs) has been rapid, with the Transformer model playing a significant role. The core of the Transformer is the attention mechanism, which acts as an information filter, allowing the model to focus on the most important parts of a sentence. However, even powerful Transformers can be distracted by irrelevant information, much like trying to find a book in a library only to be overwhelmed by a pile of unrelated books, resulting in low efficiency.
This irrelevant information generated by the attention mechanism is referred to as attention noise in the literature. Imagine trying to find a key piece of information in a document, but the Transformer model's attention is scattered to various irrelevant places, much like a nearsighted person who cannot see the focal point clearly.
To address this issue, the paper proposes the Differential Transformer (DIFF Transformer). Although the name sounds sophisticated, the principle is actually quite simple, akin to noise-canceling headphones, which eliminate noise by differentiating between two signals.
The core of the Differential Transformer is the differential attention mechanism. It divides the query and key vectors into two groups, calculates two attention maps separately, and then subtracts these maps to obtain the final attention scores. This process is similar to using two cameras to shoot the same object, then overlaying the two photos, with the differences standing out.
Through this method, the Differential Transformer can effectively eliminate attention noise, allowing the model to focus more on key information. It's like putting on noise-canceling headphones, where the surrounding noise disappears, and you can hear the desired sound more clearly.
The paper conducted a series of experiments to demonstrate the superiority of the Differential Transformer. Firstly, it performs exceptionally well in language modeling, achieving similar results with only 65% of the model size or training data of the Transformer.
Secondly, the Differential Transformer excels in long-text modeling, effectively utilizing longer contextual information.
More importantly, the Differential Transformer shows significant advantages in key information retrieval, reducing model hallucinations, and contextual learning.
In key information retrieval, the Differential Transformer acts like a precise search engine, accurately finding the content you want in vast amounts of information, even in extremely complex scenarios, maintaining high accuracy.
In reducing model hallucinations, the Differential Transformer effectively prevents the model from "talking nonsense," generating more accurate and reliable text summaries and question-answering results.
In contextual learning, the Differential Transformer is more like a top student, quickly learning new knowledge from a small number of samples, with learning outcomes that are more stable and less influenced by sample order compared to the Transformer.
Additionally, the Differential Transformer can effectively reduce outliers in model activation values, meaning it is more friendly for model quantization, allowing for lower-bit quantization and thereby improving model efficiency.
In summary, the Differential Transformer effectively solves the attention noise problem of the Transformer model through the differential attention mechanism and has achieved significant improvements in multiple aspects. It provides new insights for the development of large language models and will play an important role in more fields in the future.