Investigation into the Chaos of Large Model Evaluation: Parameter Scale Does Not Represent Everything

36氪

Published inAI News · 2 min read · Sep 25, 2023

With the surge in popularity of ChatGPT, various domestic and international large-scale model evaluation rankings have been introduced. However, large models with similar parameter sizes often show significant ranking differences across different lists. The industry and academia attribute this primarily to the use of different evaluation sets, and also to the increasing proportion of subjective questions, which raises doubts about the fairness of the evaluations. As a result, third-party evaluation institutions like OpenCompass and FlagEval have started to gain attention. However, the industry believes that to create truly comprehensive and effective large-scale model evaluations, other dimensions such as model robustness and security need to be considered. This is still under exploration.

Large Model Evaluation Parameter Scale Evaluation Set

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

The Glorious GLM-4-9B Model Achieves Only 1.3% Hallucination Rate, Winning First Place in Global Large Model Evaluation

Jan 10, 2025

3.2k

The Compass Arena, a Large Model Evaluation Platform, Adds a Multi-Modal Large Model Competition Section

The Sinan OpenCompass team at the Shanghai Artificial Intelligence Laboratory has collaborated with the Modao ModelScope to launch the Compass Multi-Modal Arena, a new section of a large model evaluation platform focusing on multi-modal large models. Users can upload images and input questions to have two anonymous multi-modal large models generate answers, which can then be subjectively evaluated based on the quality of the generated content, allowing users to select the better-performing model. The platform offers an easy-to-use interface and a unique question bank.

Aug 13, 2024

2.4k

Alibaba Qwen2-72B Tops HELM Ranking: Performance Surpasses Llama3-70B

Recently, the HELM MMLU released its latest results from Stanford University's large model evaluation leaderboard. Percy Liang, director of the Stanford Center for Research on Foundation Models, stated that Alibaba's Tongyi Qianwen Qwen2-72B model has surpassed Llama3-70B in ranking, becoming the best-performing open-source large model.

Jun 20, 2024

4.5k

Sora is trending, analysis by experts like Xie Saining on the Sora technology with a parameter scale of 3 billion

The trending Sora video generation has gone viral, sparking discussions about business opportunities. With a parameter scale of only 3 billion, Sora is rewriting the video generation field. Sora uses patch representation to train videos and images. OpenAI has discovered the efficiency of extending Transformers in the video model domain.

Feb 18, 2024

540

"Baimao Battle" Family's First, When Will Cheating in Large Model 'Scoring' Stop?

["📊 Evaluation System of Large Models: The current evaluation system for large models has issues such as open-source datasets that can be manipulated, fairness problems arising from closed evaluation datasets, and evaluation metrics that are not sufficiently scientific and comprehensive.", "💡 Trend of Large Model Applications: The article mentions that large models have evolved from model-level development to innovation at the application level.", "🔎 Commercialization Issues of Large Models: For large model teams, achieving commercialization is far more important than rankings and parameters." ]

Nov 29, 2023

910

Large Model Evaluation: Who Will Dominate the AI Field?

The large model rankings have sparked controversy, with many vendors competing for the top spot, but user experience has not met expectations. Companies are resorting to tactics such as ranking manipulation to boost their standings, leading to debate. Open-source large models have become the focus of the industry, as vendors strive to attract attention by open-sourcing their models. The differences between large models will manifest in specific user scenarios or B-end businesses. Domestic large models are still in their infancy, and everything is changing rapidly. The technical barriers are not high, but there is still a long way to go in solving problems.

Nov 9, 2023

460

Ant Group Releases Benchmark for Large Model Evaluation in the DevOps Field

Ant Group, in collaboration with Peking University, has released a benchmark for evaluating large language models in the DevOps field. This benchmark includes a total of 4850 multiple-choice questions across 8 categories such as planning, coding, building, testing, and releasing. The benchmark also provides detailed evaluations for AIOps tasks, showing that the score differences among various models are minimal.

Nov 2, 2023

690

Alibaba Cloud Tongyi Qianwen 2.0 Officially Released, Parameter Scale Reaches Hundreds of Billions

Alibaba Cloud Tongyi Qianwen 2.0 has been officially released, reaching a parameter scale of hundreds of billions. Alibaba Cloud's Chief Technology Officer, Zhou Jingren, announced a comprehensive upgrade of the Alibaba large model family. The Tongyi Qianwen 2.0 model achieves significant improvements and supports features such as voice conversation. Users can download and experience Alibaba Cloud Tongyi Qianwen 2.0, opening a new chapter in artificial intelligence.

Oct 31, 2023

1.5k

Amazon Machine Learning Team Releases Mistral 7B Base Model

Amazon has released the Mistral7B base model on SageMaker JumpStart. Mistral7B has 7 billion parameters supporting various English NLP tasks. Mistral7B utilizes a transformer architecture for fast inference speed. Users can deploy the Mistral7B model with one click as open source under the Apache 2.0 license.

Oct 10, 2023

780

Yunzhisheng Launches 2.0 Version of Shanhai Large Model with Parameter Scale Reaching Trillions

Yunzhisheng announced the launch of the 2.0 version of the Shanhai large model, with a parameter scale reaching trillions. The actual performance surpassed GPT-4 in the C-Eval global large model comprehensive evaluation, ranking among the top three with a score of 70. The model team enriched the model's knowledge base by utilizing textbooks, literature, and encyclopedic materials, achieving breakthroughs particularly in the medical field.

Aug 31, 2023

620

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview