In the field of deep learning, normalization layers are considered an indispensable component of modern neural networks. Recently, research led by Liu Zhuang, a research scientist at Meta FAIR, titled "Transformers without Normalization Layers," has garnered significant attention. This research not only introduces a new technique called Dynamic Tanh (DyT) but also demonstrates that Transformer architectures can achieve efficient training and inference without traditional normalization layers.

Normalization layers, especially Layer Normalization (LN), have played a crucial role in optimizing deep learning models over the past decade. LN layers accelerate model convergence by scaling and compressing input activations. However, researchers found that the widespread use of LN layers isn't the only option. Their research began by observing the behavior of LN layers, leading to the development of DyT, a new alternative. This element-wise operation not only mimics the scaling and compression effects of LN layers but also eliminates complex activation data calculations.

Cloud Computing, Internet, Metaverse (3)

Image Source Note: Image generated by AI, image licensing provider Midjourney

In experiments, the research team replaced traditional normalization layers in several Transformer architectures with DyT. Results showed that models using DyT trained stably and achieved higher final performance. Even more exciting, this new method typically requires no hyperparameter adjustments to the original architecture, reducing the complexity of model training.

By analyzing the forward propagation process of three different Transformer models, researchers found that early LN layers exhibited a linear relationship, but in deeper LN layers, the relationship between input and output resembled an S-shaped curve similar to the tanh function. This finding surprised the research team and provided strong empirical support for DyT's effectiveness.

Liu Zhuang stated that this work helped him gain a deeper understanding of the role of normalization layers and expects DyT to offer new possibilities for reducing the cost of model training and inference. In the future, DyT is expected to become an important candidate in efficiency-driven network design, driving further development in deep learning.