In the realm of artificial intelligence, the college entrance examination is no longer just a stage for humans. Recently, the Shanghai Artificial Intelligence Laboratory has given us a glimpse of AI's academic prowess with a unique "college entrance exam." They employed the OpenCompass evaluation system, which subjected seven AI models, including GPT-4o, to comprehensive tests in Chinese, Mathematics, and English.
Image source note: The image was generated by AI, provided by the image authorization service Midjourney
This test used the National New Curriculum Standard I paper, ensuring that all participating open-source models were already open-sourced before the college entrance examination, maintaining the fairness of the test. Moreover, the AI "answer sheets" were manually judged by teachers with experience in grading the college entrance examination, striving to meet real marking standards.
The models involved in the evaluation came from diverse backgrounds, including the Mixtral8x22B dialogue model from French AI startup Mistral, Yi-1.5-34B from Lingyi Wuren Company, GLM-4-9B from Zhipu AI, InternLM2-20B-WQX from the Shanghai Artificial Intelligence Laboratory, and the Qwen2 series from Alibaba. GPT-4o, as a closed-source model, participated in the evaluation only as a reference.
The results were announced, with Qwen2-72B leading the pack with a total score of 303, followed closely by GPT-4o with 296 points, and InternLM2-20B-WQX in third place with 295.5 points. These models performed well in Chinese and English, with an average score rate of 67% in Chinese and an impressive 81% in English. However, in Mathematics, the average score rate for all models was only 36%, indicating significant room for improvement in mathematical reasoning for AI.
The marking teachers conducted a comprehensive analysis of the AI models' answer sheets. In the Chinese subject, the models generally handled modern text comprehension well but showed some deficiencies in classical Chinese and essay writing. In Mathematics, although the models had strong formula memorization skills, they lacked flexibility in applying them during problem-solving. The English subject overall performed well, but some models had lower score rates in certain question types.
This "large-scale AI college entrance exam" not only showcases the potential of AI in the academic field but also reveals their limitations in understanding and applying knowledge. With continuous technological advancements, we have reason to believe that future AI will become smarter and better serve human society.