Recently, the latest results from the HELM MMLU large-scale model evaluation benchmark at Stanford University were released. Percy Liang, Director of the Center for Research on Foundation Models at Stanford University, noted that Alibaba's Qwen2-72B model has surpassed Llama3-70B in rankings, becoming the top-performing open-source large model.

MMLU (Massive Multitask Language Understanding) is one of the most influential large-scale model evaluation benchmarks in the industry. It covers 57 tasks including basic mathematics, computer science, law, and history, aiming to test the world knowledge and problem-solving abilities of large models. However, in practical evaluations, results from different models often lack consistency and comparability, mainly due to the use of non-standard prompting techniques and the non-unified adoption of open-source evaluation frameworks.

QQ截图20240620111950.png

The Center for Research on Foundation Models (CRFM) at Stanford University has proposed the HELM (A holistic framework for evaluating foundation models), a framework dedicated to creating a transparent and reproducible evaluation method. The HELM framework standardizes and transparentizes the evaluation results of different models on MMLU, addressing existing issues in MMLU evaluations. For example, it uses the same prompts for all evaluated models and provides each model with the same five examples for context learning in every test subject.

Percy Liang, Director of the Center for Research on Foundation Models at Stanford University, recently released the latest HELM MMLU rankings on social media. The list shows that Alibaba's Qwen2-72B open-source model ranks fifth, behind Claude3Opus, GPT-4o, Gemini1.5pro, and GPT-4, making it the highest-ranking open-source large model and the best-performing Chinese large model.

The Qwen2 series, which was open-sourced in early June 2024, includes five different sizes of pre-trained and instruction-tuned models. As of now, the Qwen series models have surpassed 16 million downloads, indicating widespread recognition and strong performance in the industry.

The latest evaluation results from HELM MMLU not only highlight the outstanding performance of Qwen2-72B in multitask language understanding but also mark the rise of Chinese large models in the global AI technology competition. With continuous technological advancements, we look forward to seeing more outstanding Chinese large models making their mark on the international stage in the future.