In the realm of artificial intelligence, language models are often likened to a mysterious black box. We feed them text, and they spit out meaning. But what exactly happens during this process? Google DeepMind's latest research, Gemma Scope, sheds light on this enigma.
The activations of language models are typically viewed as sparse, linear combinations of vectors, but the true meaning behind these combinations remains elusive. To address this issue, Sparse Autoencoders (SAEs), an unsupervised learning method, are highly anticipated. However, this technology is still in its infancy, with high training costs and slow research progress.
Google DeepMind has trained and released Gemma Scope, a set of Sparse Autoencoders trained on the Gemma2 model. It decomposes and reconstructs the activations of language models through encoders and decoders, aiming to reveal meaningful features.
Gemma Scope employs an innovative JumpReLU SAEs, which uses a shifted Heaviside step function as a gating mechanism to control activations, effectively managing the number of potential features. This design not only optimizes reconstruction loss but also directly regularizes the number of active latent features.
Gemma Scope has been meticulously trained on the activations of the Gemma2 model. During training, the model's activation vectors are normalized, and SAEs are trained at different layers and positions, including attention head outputs, MLP outputs, and post-MLP residual streams.
Gemma Scope's performance has been evaluated from multiple perspectives. Experimental results show that the Delta loss of residual stream SAEs is generally higher, and sequence length significantly impacts SAE performance. Additionally, performance varies across different subsets of datasets, with Gemma Scope performing best on DeepMind mathematics.
The release of Gemma Scope offers potential solutions to a range of open questions. It not only helps us understand SAEs more deeply but also improves performance on practical tasks and allows for red team testing of SAEs to determine if they have truly identified "true" concepts in the model.
With the application of Gemma Scope, we are poised to make significant strides in AI interpretability and security. It will help us better understand the internal workings of language models, enhancing their transparency and reliability.
Paper link: https://storage.googleapis.com/gemma-scope/gemma-scope-report.pdf
Online experience: https://www.neuronpedia.org/gemma-scope#main