A recent study indicates that leading artificial intelligence models exhibit cognitive impairments similar to early dementia symptoms when subjected to the Montreal Cognitive Assessment (MoCA) test. This finding highlights the limitations of AI in clinical applications, particularly in tasks requiring visual and executive skills.
A study published in the Christmas issue of The BMJ noted that nearly all major large language models, or "chatbots," showed signs of mild cognitive impairment when assessed using tests commonly used to detect early dementia.
The study also found that older versions of these chatbots performed worse in the tests, similar to aging human patients. Researchers believe these findings "challenge the assumption that AI will soon replace human doctors."
The latest advancements in AI have sparked excitement and concern, prompting discussions about whether chatbots might surpass human doctors in medical tasks.
Although previous studies have shown that large language models (LLMs) perform well in various medical diagnostic tasks, their susceptibility to cognitive impairments similar to those in humans (such as cognitive decline) has largely remained unexplored—until now.
To address this knowledge gap, researchers used the Montreal Cognitive Assessment (MoCA) test to evaluate the cognitive abilities of currently publicly available leading LLMs, including ChatGPT4 and 4o developed by OpenAI, Claude 3.5 "Sonnet" developed by Anthropic, and Gemini 1 and 1.5 developed by Alphabet.
The MoCA test is widely used to detect cognitive impairments and early signs of dementia, typically in older adults. Through a series of brief tasks and questions, it assesses various abilities, including attention, memory, language skills, visual-spatial skills, and executive function. The maximum score is 30, with a score of 26 or above generally considered normal.
Researchers gave the LLMs task instructions identical to those given to human patients. Scoring followed official guidelines and was assessed by a practicing neurologist.
In the MoCA test, ChatGPT4o achieved the highest score (26 out of 30), followed by ChatGPT4 and Claude (25 out of 30), while Gemini 1.0 scored the lowest (16 out of 30).
All chatbots performed poorly in visual-spatial skills and execution tasks, such as the connection test (connecting circled numbers and letters in ascending order) and the clock drawing test (drawing a clock face showing a specific time). The Gemini model failed in delayed recall tasks (remembering a sequence of five words).
All chatbots performed well in most other tasks, including naming, attention, language, and abstraction.
However, in further visual-spatial testing, the chatbots were unable to demonstrate empathy or accurately interpret complex visual scenes. Only ChatGPT4o succeeded in the inconsistent phase of the Stroop test, which measures how interference affects reaction time using combinations of color names and font colors.
These are observational findings, and researchers acknowledge the fundamental differences between the human brain and large language models.
Nonetheless, they pointed out that all large language models consistently failed in tasks requiring visual abstraction and executive functions, highlighting a significant weakness that may hinder their use in clinical settings.
Thus, they concluded: "Neurologists are not only unlikely to be replaced by large language models in the short term, but our findings suggest they may soon find themselves treating new, virtual patients—AI models exhibiting cognitive impairments."