Recently, the open-source community has received exciting news: the Shanghai AI Lab team has released the LLaMA version of the o1 project, aiming to replicate OpenAI's mathematical Olympiad problem-solving tool, o1. This project incorporates various advanced technologies, including Monte Carlo Tree Search, Self-Play reinforcement learning, PPO, and the dual-strategy paradigm of AlphaGo Zero, garnering widespread attention in the developer community.

image.png

Before the release of OpenAI's o1 series, the Shanghai AI Lab team had already begun exploring the use of Monte Carlo Tree Search to enhance the mathematical capabilities of large models. After the release of o1, the team further upgraded the algorithm, focusing on mathematical Olympiad problems and developing it as an open-source version of OpenAI's Strawberry project.

To improve the LLaMA model's performance on mathematical Olympiad problems, the team adopted a pairwise optimization strategy, where instead of providing absolute scores for answers, they compared the relative merits of two answers. Through this method, they achieved significant progress in the toughest AIME2024 benchmark test. Out of 30 test questions, the optimized model answered 8 correctly, while the original LLaMA-3.1-8B-Instruct model answered only 2 correctly. This result surpassed other commercial closed-source solutions except for o1-preview and o1-mini.

image.png

At the end of October, the team announced significant progress in replicating OpenAI's o1 based on the AlphaGo Zero architecture, successfully enabling the model to acquire advanced thinking abilities through interaction with the search tree during learning, without manual labeling. Within a week, the project was open-sourced.

Currently, the open-sourced content of the LLaMA version of o1 includes: pre-training datasets, pre-trained models, and reinforcement learning training code. The "OpenLongCoT-Pretrain" dataset contains over 100,000 long thought chain data, each containing a complete mathematical problem reasoning process, including thinking content, scoring results, problem descriptions, graphic coordinates, calculation processes, conclusion derivations, and critical and validation content for each reasoning step, providing evaluation and guidance for the reasoning process. After continued pre-training on this dataset, the model can read and output long thought chain processes like o1.

image.png

Although the project is named LLaMA-O1, the pre-trained model currently provided by the official is based on Google's Gemma2. On this basis, developers can continue with reinforcement learning training. The training process includes: using Monte Carlo Tree Search for self-play to generate experiences; storing experiences in a prioritized experience replay buffer; sampling batches of data from the buffer for training; updating model parameters and experience priorities. The training code also uses some key technologies, including parameter-efficient fine-tuning with LoRA, using the PPO algorithm as a policy optimization method, implementing the GAE algorithm for calculating the advantage function, and using prioritized experience replay to improve training efficiency.

It is worth noting that the LLaMA-O1 code is released under the GitHub account named SimpleBerry, which has no special profile and appears rather mysterious. From other accounts and official information related to SimpleBerry, it can only be inferred that it is a research laboratory, but no further information about its research direction is disclosed.

In addition to LLaMA-O1, another publicly progressing o1 replication project is O1-Journey from the Shanghai Jiao Tong University team. The team released the first progress report in early October, introducing the innovative Journey Learning paradigm and the first model to successfully integrate search and learning into mathematical reasoning. The core development team of O1-Journey mainly consists of undergraduate students from Shanghai Jiao Tong University in their junior and senior years, as well as first-year PhD students from the GAIR Lab (Generative Artificial Intelligence Research Lab), with advisors including Associate Professor Liu Pengfei, Yao Class alumni, and Sloan Prize winner Li Yuanzhi.

Paper links: https://arxiv.org/pdf/2410.02884

https://arxiv.org/pdf/2406.07394