In the world of AI, transformation often comes unexpectedly. Recently, a new architecture called TTT emerged, proposed by researchers from Stanford, UCSD, UC Berkeley, and Meta. It has overnight revolutionized the Transformer and Mamba, bringing groundbreaking changes to language models.

TTT, abbreviated as Test-Time-Training layers, is a new architecture that compresses the context through gradient descent, directly replacing the traditional attention mechanism. This method not only improves efficiency but also unlocks linear complexity architectures with expressive memory, allowing us to train language models with millions, even billions of tokens in context.

1.jpg

The proposal of the TTT layer is based on a profound insight into existing RNN and Transformer architectures. While RNNs are efficient, they are limited in expressiveness; on the other hand, Transformers are powerful in expression but have a linear increase in computational cost with the length of the context. The TTT layer cleverly combines the advantages of both, maintaining linear complexity while enhancing expressiveness.

In experiments, both variants, TTT-Linear and TTT-MLP, have shown excellent performance, outperforming Transformer and Mamba in both short and long contexts. Especially in the long context scenario, the advantage of the TTT layer is more pronounced, offering great potential for applications such as long video modeling.

2.jpg

The proposal of the TTT layer is not only innovative in theory but also demonstrates great potential in practical applications. In the future, TTT layers are expected to be applied to long video modeling, providing richer information through dense sampling of frames, which is a burden for Transformers but a blessing for TTT layers.

This research is the fruit of five years of hard work by the team, starting from Dr. Yu Sun's postdoctoral period. They persisted in exploration and continuous attempts, finally achieving this breakthrough. The success of the TTT layer is the crystallization of the team's relentless efforts and innovative spirit.

The emergence of the TTT layer brings new vitality and possibilities to the field of AI. It not only changes our understanding of language models but also opens up new paths for future AI applications. Let us look forward to the applications and development of TTT layers in the future, witnessing the progress and breakthroughs of AI technology.

For more information, please visit the paper: https://arxiv.org/abs/2407.04620