Alibaba Damo Academy, in collaboration with the ModelScope community, recently announced the open-source release of a new multilingual benchmark dataset, P-MMEval. This dataset aims to comprehensively assess the multilingual capabilities of large language models (LLMs) and to conduct comparative analysis of cross-language transfer abilities. The benchmark covers efficient datasets for both foundational and specialized capabilities, ensuring consistency in multilingual coverage across all selected datasets, and providing parallel samples across multiple languages. It supports up to 10 languages from 8 different language families, including English, Chinese, Arabic, Spanish, Japanese, Korean, Thai, French, Portuguese, and Vietnamese.

The launch of P-MMEval responds to the need for accurate and parallel evaluation results when developing and iterating large language models, which is crucial for identifying the multilingual capabilities of models and quantifying their performance. Early work primarily focused on single-task evaluations, while recent studies have proposed large-scale multilingual multi-task evaluation benchmarks that unify several representative independent benchmark tasks. However, these large-scale benchmark datasets are not consistent in their coverage of multilingual varieties.

WeChat Screenshot_20241212083907.png

P-MMEval selects available and reasonable benchmark datasets based on a significance testing method, integrating foundational natural language processing tasks and capability-specific evaluation tasks. It ensures consistency in language coverage for each task and provides cross-language parallel samples for consistent comparisons. In terms of task diversity, P-MMEval covers two key foundational NLP tasks (generation and understanding) as well as five core capabilities of current LLMs. Regarding language diversity, P-MMEval unifies ten different languages spanning eight language families.

The P-MMEval dataset has been integrated into the OpenCompass and EvalScope evaluation frameworks. Both frameworks can be used to perform evaluation tasks. OpenCompass offers an open-source, efficient, and comprehensive platform for evaluating large models, supporting one-stop evaluations for various models including large language models and multimodal models, and regularly publishes evaluation results and rankings. P-MMEval has also been quickly incorporated into the OpenCompass evaluation system, allowing evaluation tasks to be completed using the open-source tools provided by OpenCompass.

Researchers evaluated the performance of several representative instruction-tuning models, including the closed-source models GPT-4o, Claude-3.5, and the open-source models LLaMA3.1, LLaMA3.2, Qwen2.5, among others. Experimental results indicate that, except for the LLaMA3.2 series, the multilingual capabilities of all models improve with increasing model size. Qwen2.5 demonstrates strong multilingual performance in understanding and specialized capability tasks, while Gemma2 excels in generation tasks. Overall, closed-source models outperform open-source models.

The release of P-MMEval provides new tools and methods for evaluating the multilingual capabilities of large models, contributing to the development and application of multilingual NLP technologies.

Dataset Link:

https://www.modelscope.cn/datasets/modelscope/P-MMEval