CompassArena Upgrade: Launch of New Judge Copilot Feature

AIbase基地

Published inAI News · 4 min read · Dec 19, 2024

306

The OpenCompass team from the Shanghai Artificial Intelligence Laboratory, in collaboration with ModelScope, has recently launched an upgraded version of the large model evaluation platform, CompassArena. This upgrade aims to provide users with a more scientific and comprehensive model evaluation experience. Since its launch, the platform has attracted a large number of community users who have participated and contributed data. Based on this data, CompassArena continuously optimizes itself. This upgrade includes the new Judge Copilot feature and improvements to the leaderboard algorithm, as well as the addition of over 20 new models.

The Judge Copilot feature leverages the powerful evaluation model Compass-Judger-1-32B-Instruct to provide users with the ability to perform comprehensive comparative analyses of dialogue model performance. It offers multi-dimensional evaluations, real-time comparisons, and intelligent decision-making assistance, making subjective assessments more accurate and efficient. Additionally, the leaderboard algorithm has been completely upgraded, improving upon the original Bradley-Terry statistical algorithm by introducing controlled variables to reduce the influence of confounding factors, resulting in a more scientific and precise model ranking. The newly added models include both domestic and international commercial models as well as open-source models, enriching the competitive experience.

WeChat Screenshot_20241219174613.png

CompassArena places great importance on the performance of the Judge model in real-world applications and actively collects user feedback to further enhance the Judge model's overall capabilities and alignment effectiveness. Users can express their evaluations of the Judge model by clicking the "Like" and "Dislike" buttons. By fitting a Bradley-Terry statistical model that includes controlled variables, CompassArena can estimate the extent of the influence of various external factors, which can be expressed in the form of odds ratios.

This upgrade has welcomed the addition of domestic commercial models such as 360gpt2-pro, deep-seek-v2.5-chat, and doubao-pro-32k-240828, as well as international commercial models like claude-3.5-sonnet-20241022 and gemini-exp-1121, along with a series of open-source models. The newly added models come from organizations including 360, DeepSeek, and Doubao, providing users with a richer selection of competitive options.

Experience link: https://www.modelscope.cn/studios/opencompass/CompassArena

Two Losses Within Four Weeks! Musk Sues OpenAI for Stealing Trade Secrets Dismissed by US Judge

A U.S. federal judge in San Francisco dismissed xAI's lawsuit against OpenAI due to lack of evidence that OpenAI induced former employees to leak trade secrets from the Grok chatbot. The judge noted xAI failed to prove OpenAI prompted former senior engineer Li Xuechen to disclose secrets, nor found OpenAI staff involved in misconduct. The ruling ends Musk's company's legal claims against the competitor.....

Cold and Restrained: Foreign Media Tests Apple iOS 27 New Siri AI

Apple's new Siri AI in iOS27 exhibits a uniquely cool demeanor, offering extremely concise responses without small talk, in stark contrast to the verbose friendliness of Google Gemini and ChatGPT. For instance, when asked 'How's it going?', Siri avoids chitchat, focusing on direct information. This restrained, rational design stands out in the anthropomorphic AI market.....

AI Agents to Take Over in Large Scale: Indian Software Giant Slows Down Hiring and Promises No Layoffs, Embracing Human-Machine Collaboration

Tata Consultancy Services chairman stated at the annual shareholders' meeting that despite AI's impact, the company commits to no layoffs but will slow hiring, outlining a future human-AI collaboration blueprint and clarifying HR adjustments for the AI era to address changes from traditional labor-intensive models.....

Tencent and RUC Gaoqiang Jointly Launch Open-Source Planning Evaluation Framework PlanningBench

Tencent Hunyuan team, along with Renmin University of China and other institutions, has open-sourced PlanningBench, a framework for evaluating and training large language models' planning abilities. It systematically abstracts tasks, constraints, and difficulty levels, covering over 30 planning task types, and supports data generation and validation to assess models' practical planning capabilities.....

Ringing the Education Alarm! American Federation of Teachers Calls: Prohibit AI Systems and iPad Hardware from Being Used in Elementary Classrooms

The second-largest U.S. teachers' union, the American Federation of Teachers, recently launched an action calling for a ban on AI systems in elementary classrooms and urging young students to stay away from iPads and other devices to ensure human teachers lead early education. Union President Randi Weingarten proposed ten core demands, with a key clause being an immediate prohibition of AI involvement in daily teaching for elementary students.....

Refuse to Answer Nonsense? OpenAI Upgrades the Mobile Version of ChatGPT, Long Press to Customize Reasoning Intensity

OpenAI recently upgraded ChatGPT's mobile and web versions to enhance conversation naturalness and efficiency. Key updates include a hidden gesture on mobile: long-pressing the send arrow to access an intelligence level selector, allowing users to control model computation and switch response tiers for flexible intelligence adjustment.....

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

GEO Ranking Monitor

AI Conversation Insight

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Ranking Optimization

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

LLM API Proxy Checker

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator

CompassArena Upgrade: Launch of New Judge Copilot Feature

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Two Losses Within Four Weeks! Musk Sues OpenAI for Stealing Trade Secrets Dismissed by US Judge

OpenAI Invests $150 Million to Launch Partner Network, Fully Accelerating Enterprise AI Transformation

Cold and Restrained: Foreign Media Tests Apple iOS 27 New Siri AI

AI Agents to Take Over in Large Scale: Indian Software Giant Slows Down Hiring and Promises No Layoffs, Embracing Human-Machine Collaboration

Sticking to Long-Termism: Perplexity Goes Against the Trend, Aims to Officially List on the Stock Market in 2028

KPMG Survey: Only 26% of Global Companies Can Fully Control AI Costs, Token Billing Leads to Sharp Budget Increases

Tencent and RUC Gaoqiang Jointly Launch Open-Source Planning Evaluation Framework PlanningBench

4 Months to Exhaust the Annual Budget! Uber HR Department Makes Large Layoffs, Official Says: It's Truly Not Related to AI

Ringing the Education Alarm! American Federation of Teachers Calls: Prohibit AI Systems and iPad Hardware from Being Used in Elementary Classrooms

Refuse to Answer Nonsense? OpenAI Upgrades the Mobile Version of ChatGPT, Long Press to Customize Reasoning Intensity

AI News Recommendations

Two Losses Within Four Weeks! Musk Sues OpenAI for Stealing Trade Secrets Dismissed by US Judge

OpenAI Invests $150 Million to Launch Partner Network, Fully Accelerating Enterprise AI Transformation

Cold and Restrained: Foreign Media Tests Apple iOS 27 New Siri AI

AI Agents to Take Over in Large Scale: Indian Software Giant Slows Down Hiring and Promises No Layoffs, Embracing Human-Machine Collaboration

Sticking to Long-Termism: Perplexity Goes Against the Trend, Aims to Officially List on the Stock Market in 2028

KPMG Survey: Only 26% of Global Companies Can Fully Control AI Costs, Token Billing Leads to Sharp Budget Increases

Tencent and RUC Gaoqiang Jointly Launch Open-Source Planning Evaluation Framework PlanningBench

4 Months to Exhaust the Annual Budget! Uber HR Department Makes Large Layoffs, Official Says: It's Truly Not Related to AI

Ringing the Education Alarm! American Federation of Teachers Calls: Prohibit AI Systems and iPad Hardware from Being Used in Elementary Classrooms

Refuse to Answer Nonsense? OpenAI Upgrades the Mobile Version of ChatGPT, Long Press to Customize Reasoning Intensity