AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

Tsinghua University and Microsoft Join Forces to Create Differential Transformer, Boosting AI Attention and Increasing Accuracy by 30%!

AIbase基地

Published inAI News · 6 min read · Oct 10, 2024

419

Recently, the development of large language models (LLMs) has been rapid, with the Transformer model playing a significant role. The core of the Transformer is the attention mechanism, which acts as an information filter, allowing the model to focus on the most important parts of a sentence. However, even powerful Transformers can be distracted by irrelevant information, much like trying to find a book in a library only to be overwhelmed by a pile of unrelated books, resulting in low efficiency.

This irrelevant information generated by the attention mechanism is referred to as attention noise in the literature. Imagine trying to find a key piece of information in a document, but the Transformer model's attention is scattered to various irrelevant places, much like a nearsighted person who cannot see the focal point clearly.

To address this issue, the paper proposes the Differential Transformer (DIFF Transformer). Although the name sounds sophisticated, the principle is actually quite simple, akin to noise-canceling headphones, which eliminate noise by differentiating between two signals.

The core of the Differential Transformer is the differential attention mechanism. It divides the query and key vectors into two groups, calculates two attention maps separately, and then subtracts these maps to obtain the final attention scores. This process is similar to using two cameras to shoot the same object, then overlaying the two photos, with the differences standing out.

Through this method, the Differential Transformer can effectively eliminate attention noise, allowing the model to focus more on key information. It's like putting on noise-canceling headphones, where the surrounding noise disappears, and you can hear the desired sound more clearly.

The paper conducted a series of experiments to demonstrate the superiority of the Differential Transformer. Firstly, it performs exceptionally well in language modeling, achieving similar results with only 65% of the model size or training data of the Transformer.

Secondly, the Differential Transformer excels in long-text modeling, effectively utilizing longer contextual information.

More importantly, the Differential Transformer shows significant advantages in key information retrieval, reducing model hallucinations, and contextual learning.

In key information retrieval, the Differential Transformer acts like a precise search engine, accurately finding the content you want in vast amounts of information, even in extremely complex scenarios, maintaining high accuracy.

In reducing model hallucinations, the Differential Transformer effectively prevents the model from "talking nonsense," generating more accurate and reliable text summaries and question-answering results.

In contextual learning, the Differential Transformer is more like a top student, quickly learning new knowledge from a small number of samples, with learning outcomes that are more stable and less influenced by sample order compared to the Transformer.

Additionally, the Differential Transformer can effectively reduce outliers in model activation values, meaning it is more friendly for model quantization, allowing for lower-bit quantization and thereby improving model efficiency.

In summary, the Differential Transformer effectively solves the attention noise problem of the Transformer model through the differential attention mechanism and has achieved significant improvements in multiple aspects. It provides new insights for the development of large language models and will play an important role in more fields in the future.

LargeLanguageModel Transformer AttentionMechanism DIFFTransformer

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Memory Optimization! NVIDIA DLSS 4 Makes Games Smoother, Reducing VRAM by 20% with Transformer Model

Jun 30, 2025

100

MLX-LM Seamlessly Integrated with Hugging Face to Boost Efficient Large Language Model Performance on Apple Silicon Devices

May 20, 2025

640

ByteDance Unveils QuaDMix: A Unified Framework for Large Language Model Pre-training Data Quality and Diversity

Apr 28, 2025

820

Zhipu AI and Shengshu Technology Announce Strategic Partnership to Focus on Large Model Joint Innovation

On April 27, Zhipu AI (Z.ai) and Shengshu Technology (shengshu.com), two leading artificial intelligence companies under Tsinghua University, announced a major strategic partnership. This collaboration aims to leverage both companies' technological expertise in large language models and multi-modal generative models to jointly advance the technological innovation and industrial application of domestic large models.

Apr 27, 2025

500

Doubao 1.5 Deep Thinking Model Launches on Edge Large Model Gateway with Free Million Tokens

Bytedance's Volcano Engine announced the full launch of its newly released Doubao 1.5 Deep Thinking model on the edge large model gateway, offering users up to 5 million free tokens. This move has garnered significant attention in the AI field.

Apr 25, 2025

1.7k

GPT-4.1 Model Faces Scrutiny: Alignment and Stability Concerns Raised

Apr 24, 2025

850

ByteDance Releases Efficient Pre-training Length Scaling Technology, Breaking Through Long Sequence Training Bottlenecks

Apr 23, 2025

900

Revolutionizing Video Creation! Alibaba's VACE Model Unifies Text, Image, and Video Inputs

Scientists at Alibaba Group have introduced VACE, a universal AI model designed to unify a wide range of video generation and editing tasks. At the heart of VACE is an enhanced Diffusion Transformer architecture, innovating with a novel input format called "Video Conditional Unit" (VCU). VCU distills diverse modalities such as text prompts, reference images or video sequences, and spatial masks into a unified representation, and through a specialized mechanism coordinates different inputs to avoid conflicts. Concept decoupling enables fine-grained control.

Apr 23, 2025

550

MAGI-1, the World's First Autoregressive Video Generation Model, Officially Launched; Swin Transformer Team Leads a New Wave in Video Creation

A powerful new contender has emerged in the field of video generation—MAGI-1. Developed by Sand AI, a startup led by Cao Yue, winner of the Marr Prize and Tsinghua University's Special Award, this autoregressive video generation model is redefining the possibilities of video creation. MAGI-1 generates videos by predicting sequences of video blocks, garnering significant attention for its natural and fluid results and multiple downloadable versions. MAGI-1 boasts superior performance in video generation. Firstly, it delivers a seamless and smooth video experience, capable of generating...

Apr 22, 2025

3.4k

Samsung Research Unveils Novel Autoregressive Transformer for High-Resolution Image Generation

Apr 22, 2025

360