A recent study conducted by OpenAI reveals that despite the rapid advancement of artificial intelligence technology, the success rate of the most advanced language models in answering factual questions is far below expectations.

The study utilized OpenAI's own SimpleQA benchmark test, which includes 4,326 questions covering various fields such as science, politics, and art, each with a clear correct answer.

image.png

After validation by two independent reviewers, the results showed that OpenAI's best model, o1-preview, had an accuracy rate of only 42.7%, while GPT-4o was slightly lower at 38.2%. The smaller GPT-4o-mini had an accuracy rate of just 8.6%. In comparison, Anthropic's Claude model performed even worse, with Claude-3.5-sonnet having a correctness rate of only 28.9%.

image.png

The crux of this study lies in the design of the test, which aims not only to evaluate AI performance but also to raise awareness of the limitations of AI models in knowledge acquisition. Researchers emphasize that users should consider these models as information processing tools rather than fully reliable sources of knowledge. To obtain more accurate answers, it is best to provide AI with reliable data rather than solely relying on its built-in knowledge.

image.png

It is also noteworthy that AI models often overestimate their own capabilities. Researchers found that when these models are asked to rate their confidence in their answers, they typically provide inflated accuracy ratings. In tests where the same questions are repeated, even when the models consistently give the same answers, their actual success rate remains below their self-assessed accuracy. This aligns with criticisms that language models often produce absurd responses yet appear overly confident.

The researchers believe that there is a significant gap in factual accuracy within current AI systems that urgently needs improvement. They also pose an open question: whether the performance of AI in answering short factual questions can predict its performance in handling longer and more complex responses. To support the development of more reliable language models, OpenAI has publicly released the SimpleQA benchmark data on Github.

Key Points:

📊 OpenAI's research shows that the most advanced language models have a low success rate in answering factual questions, with the highest being only 42.7%.

🤖 These AI models often overestimate their abilities, with confidence ratings generally being inflated.

🔍 OpenAI has made the SimpleQA benchmark public to aid in the research of more reliable language models.