Recently, researchers from institutions such as Cornell University conducted a hallucination benchmark test on several generative AI models, including GPT-4o, Claude, and Gemini. The study found that even the most advanced models could generate hallucination-free text only about 35% of the time, indicating that AI reliability still needs improvement.
The researchers designed a set of questions not covered by Wikipedia by fact-checking against authoritative sources on topics such as law, health, and history. The results showed that OpenAI's models performed the best overall, but the progress compared to the older GPT-3.5 was limited. Interestingly, the size of the model did not determine the frequency of hallucinations; smaller models like Claude3Haiku performed similarly to larger ones.
Image source note: The image was generated by AI, provided by the image licensing service Midjourney
Co-author of the study, Zhao Wenting, pointed out that even models capable of online searching struggle to address "non-Wiki" questions, reflecting the profound impact of Wikipedia on models. She anticipates that the hallucination issue will "persist for a long time," partly due to the potential for errors in the training data itself.
A temporary solution is to increase the frequency of the models' refusal to answer. Claude3Haiku, by answering only 72% of the questions, became the most "honest" model. However, this strategy may affect user experience.
Zhao suggested that completely eliminating hallucinations may be unrealistic, but the issue can be mitigated through human fact-checking and providing references. She called for the development of relevant policies to ensure human experts are involved in the process of verifying AI-generated information.