Recent research shows that while artificial intelligence excels in areas like programming and content creation, it still falls short when dealing with complex historical issues. A study presented at the NeurIPS conference revealed that even the most advanced large language models (LLMs) struggle to achieve satisfactory results on historical knowledge tests.
The research team developed a benchmark called Hist-LLM to evaluate three top language models: OpenAI's GPT-4, Meta's Llama, and Google's Gemini. The testing was based on the Seshat Global Historical Database, and the results were disappointing: the best-performing GPT-4Turbo achieved an accuracy of only 46%.
Image Source Note: Image generated by AI, image licensed by Midjourney
Maria del Rio-Chanona, an associate professor at University College London, explained: "These models perform well on basic historical facts but struggle with in-depth historical research at a doctoral level." The study found that AI frequently makes errors in details, such as incorrectly assessing whether certain periods of ancient Egypt had specific military technologies or standing armies.
The researchers believe this poor performance stems from the AI models' tendency to infer from mainstream historical narratives, making it difficult to accurately grasp more nuanced historical details. Additionally, the study found that these models performed even worse when addressing historical issues related to regions like sub-Saharan Africa, highlighting potential biases in the training data.
Peter Turchin, head of the Complexity Science Hub (CSH), stated that this finding indicates that in certain specialized fields, AI cannot yet replace human experts. However, the research team remains optimistic about the prospects of AI in historical research and is working to improve the benchmark to help develop better models.