OpenAI officially released the highly anticipated AI language model "o1" on Thursday. This new model, dubbed "Strawberry," claims significant improvements in "reasoning" and problem-solving capabilities compared to previous large-scale language models. The o1 model series is currently available in two forms: o1-preview and o1-mini, and is open to ChatGPT Plus users and some API users.
One of the most notable features of o1 is its anthropomorphic thinking process. Before answering a question, o1 enters a special thinking mode, breaking down complex problems into smaller steps to solve them sequentially, generating a longer internal thought chain, and thus deriving more accurate answers.
This technology, referred to as "test-time computation" by Google DeepMind, focuses on using intensive, process-oriented validation reward model searches and adaptively updating the model's probability distribution for responses.
Key Points:
The o1-preview and o1-mini versions have longer response times, thinking slowly like humans;
The o1 series is in the testing phase, supporting only text functions; other features such as internet connectivity, graphing, and file import are pending development;
API prototype development is limited to 20 requests per minute;
The API lacks support for functions such as function calls, streaming output, and system information.
OpenAI states that o1-preview has surpassed its predecessor GPT-4o in several benchmark tests, including competitive programming, mathematics, and "scientific reasoning".
In programming, o1-preview ranks in the 89th percentile on competitive programming problems from Codeforces.
In the U.S. Math Olympiad Qualifying Test, o1's performance is comparable to the top 500 students in the U.S. o1's mathematical abilities are remarkable, scoring 83% in the International Mathematical Olympiad qualification exam, compared to GPT-4o's 13%.
More astonishingly, o1 has for the first time surpassed human doctoral levels in benchmark tests for physics, biology, and chemistry, marking a breakthrough in AI's complex reasoning capabilities.
The advancements in o1 are primarily attributed to a new reinforcement learning training method. This method teaches the model to spend more time "thinking" before answering questions, similar to the "let's think step by step" chain of thought prompts in other large language models. This process allows o1 to try different strategies and "identify" its own mistakes.
OpenAI states that it will continue to develop the o1 and GPT series models and plans to add web browsing, image generation, and file upload functionalities to o1-preview.
However, these impressive data are also controversial. User feedback suggests that o1 is not superior to GPT-4o in all metrics.
Additionally, the longer response time due to multi-step processing in the background has drawn some criticism. Joanne Jang, OpenAI's product manager, stated on social media: "o1 is the first reasoning model to perform exceptionally well on extremely difficult tasks, and it will only get better. But it is not yet a 'miracle model' that outperforms previous models in all aspects."
It is worth noting that AI benchmark tests are often unreliable and easily manipulated. The true capabilities of o1 need to be verified through independent user validation and experiments. Earlier this year, research from MIT showed that some of OpenAI's benchmark claims about GPT-4 from last year were incorrect or exaggerated.
In addition to performance improvements, o1 has also sparked discussions about AI "reasoning" capabilities. Some in the tech community believe that attributing human characteristics such as "thinking" or "reasoning" to AI models is inappropriate.
Official documentation: https://openai.com/index/introducing-openai-o1-preview/