Recently, a study led by the Complex Systems Research Institute (CSH) in Austria revealed that while large language models (LLMs) perform excellently in various tasks, they show shortcomings when faced with advanced historical questions. The research team tested three top models, including OpenAI's GPT-4, Meta's Llama, and Google's Gemini, and the results were disappointing.
Image Source Note: Image generated by AI, authorized by Midjourney
To evaluate the performance of these models in historical knowledge, the researchers developed a benchmarking tool called "Hist-LLM." This tool is based on the Seshat Global History Database and aims to verify the accuracy of AI responses to historical questions. The research results were presented at the renowned artificial intelligence conference NeurIPS, showing that the best-performing model, GPT-4Turbo, achieved an accuracy rate of only 46%. This result indicates that its performance is only slightly better than random guessing.
Maria del Rio-Chanona, an associate professor of computer science at University College London, stated, "Although large language models are impressive, their depth of understanding in advanced historical knowledge remains insufficient. They excel at handling simple facts but struggle with more complex historical questions." For instance, when asked whether scale armor existed in a specific period of ancient Egypt, GPT-4Turbo incorrectly responded, "Yes," whereas this technology actually appeared 1500 years later. Additionally, when researchers asked if ancient Egypt had a professional standing army, GPT-4 also incorrectly answered, "Yes," while the correct answer is no.
The research also revealed that the models performed poorly on questions related to certain regions (such as sub-Saharan Africa), indicating potential biases in their training data. The lead researcher, Peter Turchin, pointed out that these results reflect the fact that LLMs still cannot replace humans in certain fields.
Key Points:
- 📉 GPT-4Turbo's accuracy in advanced history exams is only 46%, indicating poor performance.
- 📚 The study shows that large language models still lack understanding of complex historical knowledge.
- 🌍 The research team hopes to enhance the application potential of models in historical research by improving testing tools.