Japanese artificial intelligence company Sakana AI recently launched Transformer², an innovative approach designed to help language models adapt more efficiently to various tasks. Unlike existing AI systems, Transformer² addresses the limitations language models often face when encountering new tasks through a two-stage learning process, marking progress in the field of continual learning technologies.
Current AI systems typically require handling multiple tasks in a single training session; however, they often encounter unexpected challenges when faced with new tasks, limiting the model's adaptability. The design philosophy of Transformer² directly targets this issue by employing expert vectors and Singular Value Fine-tuning (SVF) technology, enabling the model to flexibly respond to new tasks without the need to retrain the entire network.
Transformer² adopts a different training approach compared to traditional methods. Traditional training methods require adjusting the weights of the entire neural network, which is not only costly but can also lead to the model "forgetting" previously learned knowledge. In contrast, SVF technology avoids these issues by learning expert vectors that control the importance of each network connection. Expert vectors adjust the weight matrix of network connections, helping the model focus on specific tasks such as mathematical calculations, programming, and logical reasoning.
This method significantly reduces the number of parameters needed for the model to adapt to new tasks. For example, the LoRA method requires 6.82 million parameters, while SVF only needs 160,000 parameters. This not only reduces memory and processing power consumption but also prevents the model from forgetting other knowledge when focusing on a specific task. Most importantly, these expert vectors can work effectively together, enhancing the model's adaptability to diverse tasks.
To further improve adaptability, Transformer² introduces reinforcement learning. During training, the model continuously optimizes the expert vectors by proposing task solutions and receiving feedback, thereby improving its performance on new tasks. The team developed three strategies to utilize this expert knowledge: adaptive prompting, task classification, and few-shot adaptation. Notably, the few-shot adaptation strategy enhances the model's flexibility and accuracy by analyzing examples of new tasks and adjusting the expert vectors.
In multiple benchmark tests, Transformer² outperformed traditional methods like LoRA. It achieved a 16% improvement in performance on mathematical tasks while significantly reducing the required parameters. When facing entirely new tasks, Transformer²'s accuracy was 4% higher than the original model, whereas LoRA failed to achieve the expected results.
Transformer² not only solves complex mathematical problems but also combines programming and logical reasoning abilities, facilitating cross-domain knowledge sharing. For instance, the team found that smaller models could enhance their performance by transferring expert vectors, leveraging the knowledge of larger models, which opens new possibilities for knowledge sharing among models.
Despite the significant advancements in task adaptability with Transformer², it still faces some limitations. Currently, expert vectors trained using SVF can only rely on the capabilities already present in pre-trained models and cannot incorporate entirely new skills. True continual learning means that a model can autonomously learn new skills, and this goal still requires time to achieve. How to scale this technology within large models exceeding 70 billion parameters remains an unresolved issue.