Challenging Conventions: A Breakthrough Transformer Architecture Without Normalization Layers

AIbase基地

Published inAI News · 4 min read · Mar 14, 2025

In the field of deep learning, normalization layers are considered an indispensable component of modern neural networks. Recently, research led by Liu Zhuang, a research scientist at Meta FAIR, titled "Transformers without Normalization Layers," has garnered significant attention. This research not only introduces a new technique called Dynamic Tanh (DyT) but also demonstrates that Transformer architectures can achieve efficient training and inference without traditional normalization layers.

Normalization layers, especially Layer Normalization (LN), have played a crucial role in optimizing deep learning models over the past decade. LN layers accelerate model convergence by scaling and compressing input activations. However, researchers found that the widespread use of LN layers isn't the only option. Their research began by observing the behavior of LN layers, leading to the development of DyT, a new alternative. This element-wise operation not only mimics the scaling and compression effects of LN layers but also eliminates complex activation data calculations.

Cloud Computing, Internet, Metaverse (3)

Image Source Note: Image generated by AI, image licensing provider Midjourney

In experiments, the research team replaced traditional normalization layers in several Transformer architectures with DyT. Results showed that models using DyT trained stably and achieved higher final performance. Even more exciting, this new method typically requires no hyperparameter adjustments to the original architecture, reducing the complexity of model training.

By analyzing the forward propagation process of three different Transformer models, researchers found that early LN layers exhibited a linear relationship, but in deeper LN layers, the relationship between input and output resembled an S-shaped curve similar to the tanh function. This finding surprised the research team and provided strong empirical support for DyT's effectiveness.

Liu Zhuang stated that this work helped him gain a deeper understanding of the role of normalization layers and expects DyT to offer new possibilities for reducing the cost of model training and inference. In the future, DyT is expected to become an important candidate in efficiency-driven network design, driving further development in deep learning.

AI Video Generation Technology TTT: Generates One-Minute Complete Tom and Jerry Animations Directly, No Editing or Splicing Needed

A new research paper titled "One-Minute Video Generation with Test-Time Training" has been released, marking a significant advancement in AI video generation technology. This research successfully generates one-minute Tom and Jerry animations by introducing an innovative Test-Time Training (TTT) layer into a pre-trained Transformer model.

EasyControl: Empowering DiT Models with ControlNet-like Capabilities, Including Ghibli Style Transfer

In the field of AI art generation, diffusion models are transitioning from U-Net based architectures to Transformer-based architectures (DiT). However, the DiT ecosystem faces challenges in plugin support, efficiency, and multi-conditional control. Recently, a team led by Xiaojiu-z introduced EasyControl, an innovative framework designed to provide efficient and flexible conditional control capabilities for DiT models, effectively giving DiT models the power of ControlNet.

Tencent Releases the Official Version of HunYuan-T1 Large Language Model with Significantly Enhanced Reasoning Capabilities

Tencent recently released the official version of its HunYuan large language model series – HunYuan-T1. This new model, built upon the medium-scale HunYuan base model and extensively fine-tuned, demonstrates significantly improved reasoning capabilities, particularly excelling in deep thinking and complex problem-solving. Since the launch of the HunYuan T1-Preview in February, users have experienced faster and more insightful processing. The official release marks a significant upgrade to the product line. The HunYuan-T1 development team leveraged the latest Turbo...

Sesame Releases CSM Model: Real-time Emotion-Customized AI Speech Synthesis Reaches New Heights

On March 13th, Sesame unveiled its latest speech synthesis model, CSM, attracting significant industry attention. According to the official introduction, CSM adopts an end-to-end Transformer-based multimodal learning architecture. It understands contextual information to generate natural and emotionally rich speech with stunningly realistic sound. The model supports real-time speech generation, processing both text and audio inputs. Users can also control features such as tone, intonation, rhythm, and emotion by adjusting parameters, showcasing high flexibility. CSM is considered a breakthrough in AI speech technology.

Revolutionizing Long-Document Reasoning with APB: A 10x Speedup Over Flash Attention

Frustrated by the slow processing speed of large language models on long documents? Researchers from Tsinghua University have unveiled a groundbreaking technology – the APB parallel inference framework – that dramatically accelerates processing. Benchmark tests show this technology achieves a 10x speed improvement over Flash Attention when handling ultra-long texts. With the rise of models like ChatGPT, AI's ability to process vast amounts of text (hundreds of thousands of words) has increased significantly. However, this often comes at the cost of processing speed...

AI Daily: DeepSeek R2 Potentially Launching March 17th; Tencent Releases Hunyuan-TurboS; Pika Adds Video Exchange Functionality

Welcome to the 【AI Daily】column! Your daily guide to exploring the world of Artificial Intelligence. We present daily highlights from the AI field, focusing on developers, helping you understand technology trends and learn about innovative AI product applications. Discover new AI products here: https://top.aibase.com/ 1. Tencent Releases Hunyuan-TurboS: The first ultra-large hybrid Transformer-MambaMoE model makes its debut. Tencent today launched Hunyuan-TurboS on X platform...

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Challenging Conventions: A Breakthrough Transformer Architecture Without Normalization Layers

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Video Generation Technology TTT: Generates One-Minute Complete Tom and Jerry Animations Directly, No Editing or Splicing Needed

EasyControl: Empowering DiT Models with ControlNet-like Capabilities, Including Ghibli Style Transfer

NVIDIA AI Researchers Introduce FFN Fusion Technology: Accelerating Large Language Model Inference

Tencent Releases the Official Version of HunYuan-T1 Large Language Model with Significantly Enhanced Reasoning Capabilities

Moore Threads Open-Sources Two Major AI Frameworks, Achieving Over 90% Training Efficiency on Domestic GPUs

Sesame Releases CSM Model: Real-time Emotion-Customized AI Speech Synthesis Reaches New Heights

Revolutionizing Long-Document Reasoning with APB: A 10x Speedup Over Flash Attention

AI Daily: DeepSeek R2 Potentially Launching March 17th; Tencent Releases Hunyuan-TurboS; Pika Adds Video Exchange Functionality

Tencent Unveils Hunyuan-TurboS: The First Ultra-Large Hybrid Transformer-Mamba MoE Model

No Training Needed! Q-Filters Enable Efficient Compression of KV Cache and Improved Inference Performance