Anthropic's latest model, Claude3.5Sonnet, has demonstrated remarkable performance in recent technical evaluations, surpassing even professional doctoral levels. In the Graduate-Level Question Answering (GPQA) test, Claude3.5Sonnet achieved a score of 67.2%, marking the first time a large language model has broken the 65% threshold in such assessments and indicating a new height in its ability to understand and answer advanced scientific knowledge questions.
GPQA, as a benchmark test for measuring the ability of language models to answer scientific knowledge questions at the graduate level, covers a range of complex and profound issues, demanding high levels of reasoning and knowledge integration from the models. In this challenging test, the average score for ordinary doctoral degree holders is about 34%, while specialized doctoral degree holders average 65%. It is worth noting that a language model achieving a GPQA score of 60% is roughly equivalent to an IQ of 150.
Although there are currently no specific data on GPT-4o and GPT-4T in the GPQA evaluation, based on existing information, Claude3.5Sonnet's performance appears to surpass these two models. In other relevant evaluations, such as the 0-shot CoT evaluation, Claude3.5Sonnet also scored higher than GPT-4o (53.6%) and GPT-4T (48.0%), further demonstrating its leading position in language understanding and problem-solving.
Anthropic's achievement not only showcases the powerful capabilities of Claude3.5Sonnet but also sets a new benchmark for large language models in handling advanced knowledge-based question-answering tasks. With continuous technological advancements, the potential applications of these models in various fields will undoubtedly become even more extensive in the future.