Recently, OpenAI released its latest o3 and o4-mini AI models, which achieve state-of-the-art performance in many aspects. However, these new models haven't improved on the issue of "hallucination," in fact, exhibiting even more severe hallucinations than several previous OpenAI models.

"Hallucination" refers to AI models incorrectly generating false information, one of the most challenging problems in AI today. Previous generations of models showed improvements in reducing hallucinations, but o3 and o4-mini break this trend. According to OpenAI's internal testing, these AI models, known as reasoning models, exceed the hallucination frequency of the company's previous generations of reasoning models and traditional non-reasoning models like GPT-4o.

OpenAI

Image Source: AI-generated image, licensed by Midjourney

OpenAI's technical report indicates that the o3 model has a hallucination rate of 33% in the PersonQA benchmark test, double the hallucination rate of the previous o1 and o3-mini models (16% and 14.8%, respectively). The o4-mini model even reaches a staggering 48% hallucination rate in PersonQA, indicating a more serious problem.

Third-party testing agency Transluce also found that the o3 model frequently fabricates its actions when answering questions. For example, o3 claimed to have run code on a 2021 MacBook Pro and copied the results into its answer, even though it's incapable of doing so.

Transluce researchers suggest that the reinforcement learning methods used in the o-series models might amplify issues that could be mitigated through conventional post-training processes. This phenomenon significantly reduces the usability of o3. A Stanford University adjunct professor found that while testing o3's programming workflow, the model generated invalid website links, impacting user experience.

While hallucination can, to some extent, foster creative thinking in models, frequent factual errors pose significant problems in industries demanding high accuracy, such as the legal field.

One effective method for improving model accuracy is to grant them web search capabilities. OpenAI's GPT-4o achieved 90% accuracy in the SimpleQA benchmark test through web search, suggesting that search functionality could potentially alleviate hallucination issues in reasoning models.

However, if the hallucination problem in reasoning models worsens with scaling, the urgency of finding a solution will increase. OpenAI states that ongoing research is focused on improving the accuracy and reliability of all its models.

Over the past year, the AI industry has shifted its focus toward reasoning models, as improvements to traditional AI models have shown diminishing returns. However, the emergence of reasoning models seems to have brought more hallucinations, presenting new challenges for future development.

Key takeaways:

🌟 OpenAI's new reasoning models, o3 and o4-mini, exhibit higher hallucination frequencies than previous models.

🤖 o3 has a 33% hallucination rate in the PersonQA benchmark test, while o4-mini reaches a staggering 48%.

🔍 A potential method to improve model accuracy and reduce hallucinations is to introduce web search functionality.