Recently, Vectara's machine learning team conducted an in-depth hallucination test on two models from the DeepSeek series. The results showed that the hallucination rate of DeepSeek-R1 reached 14.3%, significantly higher than its predecessor, DeepSeek-V3, which had a rate of 3.9%. This indicates that during the process of enhanced reasoning, DeepSeek-R1 generated more inaccurate or inconsistent content compared to the original information. This result has sparked widespread discussion about the hallucination rates of enhanced reasoning large language models (LLMs).
Image Source Note: Image generated by AI, licensed by Midjourney
The research team pointed out that reasoning-enhanced models may be more prone to hallucinations than regular large language models. This phenomenon is particularly evident when comparing the DeepSeek series with other reasoning-enhanced models. For example, the difference in hallucination rates between the reasoning-enhanced GPT-o1 and the standard GPT-4o also supports this hypothesis.
To evaluate the performance of these two models, researchers used Vectara's HHEM model and Google's FACTS method for assessment. HHEM, as a specialized hallucination detection tool, demonstrated high sensitivity in capturing the increased hallucination rate of DeepSeek-R1, while the FACTS model performed relatively poorly in this aspect. This suggests that HHEM may be more effective than LLMs as a standard.
It is noteworthy that although DeepSeek-R1 performs excellently in reasoning capabilities, it is accompanied by a higher hallucination rate. This may be related to the complex logic that reasoning-enhanced models need to process. As the complexity of the model's reasoning increases, the accuracy of the generated content may be adversely affected. The research team also emphasized that if DeepSeek could focus more on reducing hallucination issues during the training phase, it might achieve a good balance between reasoning ability and accuracy.
Although reasoning-enhanced models typically exhibit higher hallucination rates, this does not mean they lack advantages in other areas. For the DeepSeek series, further research and optimization are needed to address hallucination issues and enhance overall model performance.