The OpenCompass team from the Shanghai Artificial Intelligence Laboratory, in collaboration with ModelScope, has recently launched an upgraded version of the large model evaluation platform, CompassArena. This upgrade aims to provide users with a more scientific and comprehensive model evaluation experience. Since its launch, the platform has attracted a large number of community users who have participated and contributed data. Based on this data, CompassArena continuously optimizes itself. This upgrade includes the new Judge Copilot feature and improvements to the leaderboard algorithm, as well as the addition of over 20 new models.

The Judge Copilot feature leverages the powerful evaluation model Compass-Judger-1-32B-Instruct to provide users with the ability to perform comprehensive comparative analyses of dialogue model performance. It offers multi-dimensional evaluations, real-time comparisons, and intelligent decision-making assistance, making subjective assessments more accurate and efficient. Additionally, the leaderboard algorithm has been completely upgraded, improving upon the original Bradley-Terry statistical algorithm by introducing controlled variables to reduce the influence of confounding factors, resulting in a more scientific and precise model ranking. The newly added models include both domestic and international commercial models as well as open-source models, enriching the competitive experience.

WeChat Screenshot_20241219174613.png

CompassArena places great importance on the performance of the Judge model in real-world applications and actively collects user feedback to further enhance the Judge model's overall capabilities and alignment effectiveness. Users can express their evaluations of the Judge model by clicking the "Like" and "Dislike" buttons. By fitting a Bradley-Terry statistical model that includes controlled variables, CompassArena can estimate the extent of the influence of various external factors, which can be expressed in the form of odds ratios.

This upgrade has welcomed the addition of domestic commercial models such as 360gpt2-pro, deep-seek-v2.5-chat, and doubao-pro-32k-240828, as well as international commercial models like claude-3.5-sonnet-20241022 and gemini-exp-1121, along with a series of open-source models. The newly added models come from organizations including 360, DeepSeek, and Doubao, providing users with a richer selection of competitive options.

Experience link: https://www.modelscope.cn/studios/opencompass/CompassArena