OpenAI has introduced a new AI safety method aimed at enhancing the security of AI systems by changing how they process safety rules. This new series of models, known as the o series, no longer solely relies on learning good and bad behaviors through examples, but can understand and actively reason through specific safety guidelines.
In OpenAI's research, an example was given where when a user attempted to obtain instructions for illegal activities through encrypted text, the model successfully decoded the information but refused the request, specifically citing the safety rules that would be violated. This step-by-step reasoning process demonstrates how the model effectively adheres to relevant safety guidelines.
The training process for the o1 model is divided into three phases. First, the model learns how to provide assistance. Next, through supervised learning, the model studies specific safety guidelines. Finally, the model uses reinforcement learning to practice applying these rules, helping it truly understand and internalize these safety guidelines.
In OpenAI's tests, the newly launched o1 model significantly outperformed other mainstream systems, such as GPT-4o, Claude3.5Sonnet, and Gemini1.5Pro, in terms of safety. The tests included how the model refused harmful requests while allowing appropriate ones, and the results showed that the o1 model achieved the highest scores in both accuracy and resistance to jailbreak attempts.
OpenAI co-founder Wojciech Zaremba expressed pride in this "deliberative alignment" work on social media, believing that this reasoning model can align in a new way. He noted that ensuring systems align with human values is a significant challenge, especially in the development of Artificial General Intelligence (AGI).
Despite OpenAI claiming progress, a hacker known as "Liberator Pliny" demonstrated that even the new o1 and o1-Pro models can be manipulated to bypass safety guidelines. Pliny successfully made the model generate adult content and even share instructions for making Molotov cocktails, despite the system initially refusing these requests. These incidents highlight the difficulty of controlling these complex AI systems, as they operate based on probabilities rather than strict rules.
Zaremba stated that OpenAI has about 100 employees dedicated to AI safety and ensuring alignment with human values. He questioned the safety practices of competitors, particularly Elon Musk's xAI company, which prioritizes market growth over safety measures, while Anthropic recently launched an AI agent without proper safeguards, which Zaremba believes could lead to "significant negative feedback" for OpenAI.
Official blog: https://openai.com/index/deliberative-alignment/
Key points:
🌟 OpenAI's new o series models can actively reason through safety rules, enhancing system security.
🛡️ The o1 model outperforms other mainstream AI systems in rejecting harmful requests and accuracy.
🚨 Despite improvements, the new models can still be manipulated, and safety challenges remain severe.