The Million Experts Mixture model proposed by Google DeepMind, a revolutionary study that has taken a significant step forward in the Transformer architecture.

Imagine a model capable of sparse retrieval from a million mini-experts - doesn't that sound a bit like a science fiction novel plot? Yet, this is the latest research achievement from DeepMind. The core of this research is a highly parameter-efficient expert retrieval mechanism, which separates the computation cost from the parameter count using product key technology, thus unleashing the greater potential of the Transformer architecture while maintaining computational efficiency.

image.png

The highlight of this work lies in its exploration of extreme MoE settings and the first-time demonstration that learning index structures can effectively route to over a million experts. It's as if, amidst a sea of people, we can quickly find those few experts who can solve problems, all while keeping the computational costs under control.

In the experiments, the PEER architecture demonstrated exceptional computational performance, outperforming dense FFW, coarse-grained MoE, and product key memory (PKM) layers in efficiency. This is not just a theoretical victory but a significant leap in practical application. Empirical results show that PEER performs superiority in language modeling tasks, with lower perplexity and significant performance improvements in ablation experiments by adjusting the number of experts and the number of active experts.

The author of this study, Xu He (Owen), is a research scientist at Google DeepMind. His solitary exploration has undoubtedly brought new insights to the AI field. As he demonstrated, through personalized and intelligent methods, we can significantly improve conversion rates and retain users, which is particularly important in the AIGC field.

Paper address: https://arxiv.org/abs/2407.04153