The article examines the phenomenon of "benchmarking chaos" in the current evaluation systems for large models, noting that there is a widespread occurrence of "everyone being number one" in the rankings. The existing open-source benchmarking datasets can lead to a "problem-solving" mentality, while closed proprietary datasets can affect fairness. Additionally, some rankings lack scientific and comprehensive evaluation dimensions. The article suggests establishing an authoritative evaluation system, open-sourcing the evaluation tools and processes to ensure fairness, but adopting a model of open historical + closed formal datasets for evaluation. Moreover, the commercialization of large models is far more important than the parameters of the models or their rankings on the leaderboards.
"Baimao Battle" Family's First, When Will Cheating in Large Model 'Scoring' Stop?
罗超频道
64
© Copyright AIbase Base 2024, Click to View Source - https://www.aibase.com/news/3649