A new study suggests that OpenAI's o1-preview AI system may outperform human doctors in diagnosing complex medical cases. Research teams from Harvard Medical School and Stanford University conducted comprehensive medical diagnostic tests on o1-preview, revealing significant improvements compared to earlier versions.
According to the study results, o1-preview achieved a correct diagnosis rate of 78.3% across all tested cases. In a direct comparison of 70 specific cases, the system's accuracy soared to 88.6%, significantly surpassing its predecessor GPT-4's 72.9%. The performance of o1-preview in medical reasoning is also noteworthy. Using the R-IDEA scale, a standard for assessing medical reasoning quality, the AI system scored full marks on 78 out of 80 cases. In contrast, experienced doctors achieved full marks in only 28 cases, while medical residents managed to do so in just 16 cases.
The researchers acknowledged that o1-preview might have included some test cases in its training data. However, when they tested the system on new cases, its performance only slightly declined. Dr. Adam Rodman, one of the authors of the study, emphasized that while this is a benchmark study, the findings have important implications for medical practice.
o1-preview particularly excelled in handling complex management cases specifically designed by 25 experts. "Humans struggle with these challenging problems, but o1's performance is impressive," Rodman explained. In these complex cases, o1-preview scored 86%, while doctors using GPT-4 only scored 41%, and traditional tools scored just 34%.
However, o1-preview is not without its flaws. The system showed no significant improvement in probability assessments; for example, when estimating the likelihood of pneumonia, o1-preview provided a 70% estimate, which is well above the scientific range of 25%-42%. Researchers found that o1-preview performed exceptionally well in tasks requiring critical thinking but struggled with more abstract challenges, such as estimating probabilities.
Moreover, o1-preview typically provides detailed answers, which may enhance its scoring. However, the study focused solely on the performance of o1-preview working independently, without assessing its effectiveness in collaboration with doctors. Some critics pointed out that the diagnostic tests suggested by o1-preview are often costly and impractical.
Despite OpenAI releasing new versions of o1 and o3 that excel in complex reasoning tasks, these more powerful models still do not address the practical application and cost issues raised by critics. Rodman called for researchers to develop better methods for evaluating medical AI systems to capture complexity in real medical decision-making. He emphasized that this study does not imply that AI can replace doctors; real medical care still requires human involvement.
Paper: https://arxiv.org/abs/2412.10849
Key Points:
🌟 o1-preview surpasses doctors in diagnosis rates, achieving an accuracy of 88.6%.
🧠 In medical reasoning, o1-preview scored full marks on 78 out of 80 cases, far exceeding doctor performance.
💰 Despite its excellent performance, the high costs and impractical test suggestions of o1-preview in real-world applications still need to be addressed.