Cornell University and other institutions conducted hallucination benchmarking tests on generative AI models like GPT-4o, Claude, and Gemini, finding that these models produce hallucination-free text only about 35% of the time, indicating that the reliability of AI still has room for improvement. The research involved fact-checking through authoritative sources across topics like law, health, and history, designing question sets that included content not covered by Wikipedia. The results showed that OpenAI's models performed the best overall, but the improvement compared to GPT-3.5 was limited. The study pointed out that model size is not related to the frequency of hallucinations.