Despite the remarkable progress AI has made in the medical field, a new study shows that general-purpose AI like ChatGPT still has significant flaws in complex medical diagnoses.
A research team led by medical educator Amrit Kirpalani from Western University in Ontario, Canada, found that ChatGPT made errors in 76 out of 150 complex medical cases from Medscape, with an error rate exceeding 50%.
The study used Medscape's question bank, which is closer to real medical scenarios than the United States Medical Licensing Examination (USMLE), including various complications and diagnostic challenges. The research team cleverly designed prompts to bypass OpenAI's ban on using ChatGPT for medical advice.
Image source: AI-generated, provided by Midjourney
Kirpalani pointed out that ChatGPT's poor performance is mainly due to two factors: firstly, compared to specialized medical AI, ChatGPT lacks in-depth medical expertise; secondly, it performs poorly in handling medical "gray areas," unable to interpret slightly abnormal test results as flexibly as human doctors.
More concerning is that even when providing incorrect diagnoses, ChatGPT can offer seemingly reasonable and persuasive explanations. This characteristic could mislead non-professionals and increase the risk of misinformation spreading.
Nevertheless, AI has its value in the medical field. Co-author Edward Tran stated that ChatGPT has become an important tool in medical school education, helping students organize notes, clarify diagnostic algorithms, and prepare for exams. However, Kirpalani strongly advises the public not to use ChatGPT for medical advice and to continue consulting professional healthcare providers.
Kirpalani believes that building a reliable AI doctor requires extensive clinical data training and strict supervision. In the short term, AI is more likely to enhance the work of human doctors rather than completely replace them. With continuous technological advancements, the application of AI in healthcare will remain a topic of interest.