The Beijing Academy of Artificial Intelligence (BAAI) has recently launched FlagEval Debate, the world's first Chinese large-scale model debate platform. This new platform aims to provide a novel metric for evaluating the capabilities of large language models through the competitive mechanism of model debates. It is an extension of BAAI's FlagEval large-model arena service, designed to discern the differences in capabilities among large language models.
Current large-model battles have some issues, such as frequent draws that make it difficult to distinguish between models; reliance on user voting for test content, requiring extensive user participation; and a lack of interaction between models in existing battle formats. To address these issues, BAAI has adopted the form of large-model debates for evaluation.
Debate, as an intellectual activity involving language, can showcase participants' logical thinking, language organization, and information analysis and processing abilities. Model debates can demonstrate the levels of large models in information understanding, knowledge integration, logical reasoning, language generation, and conversational abilities, while also testing their depth of information processing and adaptability in complex contexts.
BAAI has found that the interactive format of debates can highlight the gaps between models and allow for the calculation of effective model rankings based on a small number of data samples. Therefore, they have launched the FlagEval Debate platform for Chinese large-scale model debates based on crowdsourcing.
The platform supports two models in debating around randomly selected topics from a database primarily composed of hot topics, topics crafted by evaluation experts, and top debate experts. All users can judge each debate on the platform to enhance the user experience.
Each model debate includes five rounds of opinion presentation, with each side having one opportunity. To avoid bias from the positions of the sides, each model will take both the affirmative and negative sides once. Each large model will engage in multiple debates with other models, and the final rankings will be calculated based on winning points.
Model debates are evaluated through both open crowdsourcing and expert reviews, with the expert panel consisting of participants and judges from professional debates. Open crowdsourcing allows viewers to freely appreciate and vote.
BAAI stated that it will continue to explore the technical pathways and application value of model debates, adhering to principles of science, authority, fairness, and openness, and continuously improve the FlagEval large-model evaluation system to provide new insights and thoughts for the large-model evaluation ecosystem.
FlagEval Debate Official Website:
https://flageval.baai.org/#/debate