Early this morning, the Alibaba Tongyi Qianwen team released the Qwen2 series of open-source models. This series includes five sizes of pre-trained and instruction-tuned models: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. Key information indicates that these models have significantly improved in terms of parameter count and performance compared to the previous generation, Qwen1.5.

Regarding the multilingual capabilities of the models, the Qwen2 series has invested heavily in increasing the quantity and quality of the dataset, covering 27 other languages besides English and Chinese. Comparative testing has shown that large models (with over 70B parameters) excel in natural language understanding, coding, mathematical abilities, and more. The Qwen2-72B model has even surpassed its predecessor in terms of performance and parameter count.

The Qwen2 models not only demonstrate strong capabilities in basic language model evaluations but also achieve remarkable results in instruction-tuned model assessments. Their multilingual abilities shine in benchmarks like M-MMLU and MGSM, showcasing the powerful potential of Qwen2 instruction-tuned models.

The release of the Qwen2 series marks a new height in artificial intelligence technology, providing broader possibilities for global AI applications and commercialization. Looking ahead, Qwen2 will further expand model sizes and multimodal capabilities, accelerating the development of the open-source AI field.

Model Information

The Qwen2 series includes five sizes of base and instruction-tuned models, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. We have outlined the key information for each model in the table below:

ModelsQwen2-0.5BQwen2-1.5BQwen2-7BQwen2-57B-A14BQwen2-72B
# Parameters0.49M1.54M7.07B57.41B72.71B
# Non-Emb Parameters0.35M1.31B5.98M56.32M70.21B
Quality AssuranceTrueTrueTrueTrueTrue
Tie EmbeddingTrueTrueFalseFalseFalse
Context Length32K32K128K64K128K

Specifically, in Qwen1.5, only Qwen1.5-32B and Qwen1.5-110B used Group Query Attention (GQA). This time, we applied GQA to all model sizes to enable them to benefit from faster speeds and less memory usage during model inference. For smaller models, we prefer to apply tying embedding because large sparse embeddings account for a significant portion of the model's total parameters.

In terms of context length, all base language models have been pre-trained on data with a context length of 32K tokens, and we have observed satisfactory extrapolation capabilities up to 128K in PPL evaluations. However, for instruction-tuned models, we are not satisfied with just PPL evaluations; we need the models to correctly understand long contexts and complete tasks. In the table, we list the context length capabilities of the instruction-tuned models, which are evaluated through assessments on the Needle in a Haystack task. Notably, when enhanced with YARN, the Qwen2-7B-Instruct and Qwen2-72B-Instruct models both exhibit impressive capabilities, able to handle context lengths of up to 128K tokens.

We have made significant efforts to increase the quantity and quality of the pre-training and instruction-tuning datasets, which cover multiple languages besides English and Chinese, to enhance their multilingual capabilities. Although large language models inherently have the ability to generalize to other languages, we explicitly emphasize that we have included 27 other languages in our training:

RegionLanguages
Western EuropeGerman, French, Spanish, Portuguese, Italian, Dutch
Eastern Europe and Central EuropeRussian, Czech, Polish
Middle EastArabic, Persian, Hebrew, Turkish
East AsiaJapanese, Korean
Southeast AsiaVietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
South AsiaHindi, Bengali, Urdu

Additionally, we have invested considerable effort in addressing the issue of code-switching that often arises in multilingual evaluations. Therefore, our models' ability to handle this phenomenon has significantly improved. Evaluations using prompts that typically trigger cross-language code-switching have confirmed a significant reduction in related issues.

Performance

Comparative test results show that the performance of large-scale models (with over 70B parameters) has significantly improved compared to Qwen1.5. This test centers on the large-scale model Qwen2-72B. In terms of base language models, we compared the performance of Qwen2-72B with the current best open-source models in natural language understanding, knowledge acquisition, programming abilities, mathematical abilities, multilingual abilities, and more. Thanks to carefully selected datasets and optimized training methods, Qwen2-72B outperforms leading models like Llama-3-70B, and even surpasses the previous generation Qwen1.5-110B with fewer parameters.

After extensive large-scale pre-training, we conducted post-training to further enhance Qwen's intelligence, bringing it closer to human capabilities. This process further improved the model's abilities in coding, mathematics, reasoning, instruction following, multilingual understanding, and more. Additionally, it aligns the model's outputs with human values, ensuring they are useful, honest, and harmless. Our post-training phase is designed with principles of scalable training and minimal human annotation. Specifically, we researched how to obtain high-quality, reliable, diverse, and creative demonstration data and preference data through various automatic alignment strategies, such as rejection sampling for mathematics, execution feedback for coding and instruction following, back-translation for creative writing, and scalable supervision for role-playing. As for training, we combined supervised fine-tuning, reward model training, and online DPO training. We also adopted a novel online merging optimizer to minimize the alignment tax. These combined efforts significantly enhanced the capabilities and intelligence of our models, as shown in the table below.

We conducted a comprehensive evaluation of Qwen2-72B-Instruct across 16 benchmarks in various fields. Qwen2-72B-Instruct achieved a balance between better capabilities and alignment with human values. Specifically, Qwen2-72B-Instruct significantly outperformed Qwen1.5-72B-Chat in all benchmarks and achieved competitive performance compared to Llama-3-70B-Instruct.

On smaller models, our Qwen2 models also outperform similar or even larger SOTA models. Compared to the recently released SOTA models, Qwen2-7B-Instruct still shows an advantage in various benchmarks, especially in coding and Chinese-related metrics.

Emphasis

Coding and Mathematics

We have always been committed to enhancing Qwen's advanced features, especially in coding and mathematics. In coding, we successfully integrated CodeQwen1.5's code training experience and data, resulting in significant improvements in Qwen2-72B-Instruct's capabilities in various programming languages. In mathematics, by leveraging extensive and high-quality datasets, Qwen2-72B-Instruct has demonstrated stronger abilities in solving mathematical problems.

Long Context Understanding

In Qwen2, all instruction-tuned models have been trained in a 32k length context and use technologies like YARN or Dual Chunk Attention to infer to longer context lengths.

The chart below shows our test results on Needle in a Haystack. Notably, Qwen2-72B-Instruct can perfectly handle information extraction tasks in a 128k context, coupled with its inherent strong performance, making it the preferred choice for handling long text tasks when resources are sufficient.

Additionally, it is worth noting the impressive capabilities of the other models in the series: Qwen2-7B-Instruct almost perfectly handles a context length of up to 128k, Qwen2-57B-A14B-Instruct manages up to 64k, and the two smaller models in the series support 32k.

In addition to long context models, we have also open-sourced an agent solution for efficiently processing documents containing up to 1 million tokens. For more details, please refer to our dedicated blog post on this topic.

Safety and Responsibility

The table below shows the proportion of harmful responses generated by large models for four types of multilingual unsafe queries (illegal activities, fraud, pornography, privacy violence). The test data comes from Jailbreak and is translated into multiple languages for evaluation. We found that Llama-3 cannot effectively handle multilingual prompts, so it was not included in the comparison. Through significance testing (P_value), we found that the Qwen2-72B-Instruct model's performance in terms of safety is comparable to GPT-4 and significantly better than the Mistral-8x22B model.

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team
© Copyright AIbase Base 2024, Click to View Source -

AI News Recommendations

ByteDance Research Open-Sources ChatTS-14B: Native Understanding and Reasoning Over Time

ByteDance Research Open-Sources ChatTS-14B: Native Understanding and Reasoning Over Time

ByteDance Research has announced the open-sourcing of ChatTS-14B, a 14-billion parameter large language model (LLM) specifically designed for understanding and reasoning with time series data. Released under the Apache2.0 license, ChatTS-14B's open-source release has garnered significant attention within the AI community, marking a substantial advancement in the intersection of time series analysis and generative AI. ChatTS-14B: An Intelligent Conversational Engine for Time Series. ChatTS-14B is based on Qwen2.5-1...

Apr 21, 2025
150
Kimina-Prover: An Open-Source Mathematical Theorem Proving Model

Kimina-Prover: An Open-Source Mathematical Theorem Proving Model

The Kimi team recently released a technical report and open-sourced the preview version of Kimina-Prover, including 1.5B and 7B parameter distilled models, the Kimina-Autoformalizer-7B model for data generation, and a revised miniF2F benchmark dataset. Kimina-Prover, jointly developed by the Numina and Kimi teams, is a mathematical theorem proving model that excels in the field of formal theorem proving.

Apr 17, 2025
240
Stanford Report Confirms: Alibaba's Qwen Ranks Third Globally in Large Model Contribution, Reshaping Global Competition with Computing Power!

Stanford Report Confirms: Alibaba's Qwen Ranks Third Globally in Large Model Contribution, Reshaping Global Competition with Computing Power!

Stanford University's AI Index Report 2025 offers a fresh perspective on the global AI landscape. The report highlights Alibaba's significant contribution, ranking third globally among major large language models, establishing it as a leading Chinese tech company. In 2024, China contributed 15 models globally, with Alibaba contributing 6, trailing only Google and OpenAI with 7 models each. This achievement reflects Alibaba's ongoing commitment to technological innovation.

Apr 12, 2025
1.0k
AI Daily: Alibaba's Qwen Tops Global Open-Source Model Ranking; MiniMax Launches Speech-02; ChatGPT Paid Users Surge to 20 Million

AI Daily: Alibaba's Qwen Tops Global Open-Source Model Ranking; MiniMax Launches Speech-02; ChatGPT Paid Users Surge to 20 Million

Welcome to the 【AI Daily】column! Your daily guide to exploring the world of artificial intelligence. We present you with the hottest AI news, focusing on developers and helping you understand technology trends and innovative AI product applications. Discover new AI products: https://top.aibase.com/ 1. Alibaba's Qwen-2.5-Omni Tops Global Open-Source Model Ranking On April 2nd, 2024, Hugging Face released its latest large model ranking, with Alibaba's Qwen...

Apr 2, 2025
730
Alibaba's Qwen-2.5-Omni Tops Global Open-Source Model Leaderboard

Alibaba's Qwen-2.5-Omni Tops Global Open-Source Model Leaderboard

Apr 2, 2025
4.5k
Tuniu Launches AI Assistant Xiao Niu: Open-Source Large Model Empowers One-Stop Smart Travel Service

Tuniu Launches AI Assistant Xiao Niu: Open-Source Large Model Empowers One-Stop Smart Travel Service

On April 1st afternoon, Tuniu Travel announced the official launch of its self-developed AI assistant, "Xiao Niu," a travel application agent available on both the Tuniu Travel app and the "Xiao Niu" mini-program. According to the announcement, "Xiao Niu" innovatively utilizes the open-source large models DeepSeek and Tongyi Qianwen, deeply integrating with vertical travel application scenarios to provide users with a more convenient and efficient travel experience. Through "Xiao Niu," users can easily search and book air tickets, hotels, and train tickets. Furthermore, this AI...

Apr 1, 2025
380
Research Finds: Number of Documents in RAG Systems Impacts Language Model Performance

Research Finds: Number of Documents in RAG Systems Impacts Language Model Performance

Researchers at the Hebrew University of Jerusalem recently discovered that in Retrieval Augmented Generation (RAG) systems, the number of documents processed significantly impacts language model performance, even when the total text length remains constant. The research team conducted experiments using 2,417 questions from the MuSiQue validation dataset, each linked to 20 Wikipedia paragraphs. Two to four paragraphs contained relevant answer information, with the remaining paragraphs serving as distractors. To study the impact of the number of documents, the team created multiple data partitions, progressively reducing the number of documents from 20 to...

Mar 31, 2025
240
AI Daily: Taobao Launches AI Fight Against Fake Images; OpenAI Announces Support for MCP Protocol; Alibaba Open-Sources Multimodal Model Qwen2.5-Omni

AI Daily: Taobao Launches AI Fight Against Fake Images; OpenAI Announces Support for MCP Protocol; Alibaba Open-Sources Multimodal Model Qwen2.5-Omni

Welcome to the "AI Daily" column! Your daily guide to exploring the world of artificial intelligence. We present the hottest AI news, focusing on developers and helping you understand technology trends and innovative AI applications. Discover new AI products here: https://top.aibase.com/ 1、 Alibaba's Tongyi Qianwen open-sources the new generation end-to-end multimodal model Qwen2.5-Omni. The Alibaba Cloud Tongyi Qianwen team has launched Qwen2.5-Omni, a new generation multimodal...

Mar 27, 2025
240
Alibaba Unveils its First Multimodal Large Model, Qwen2.5-Omni, Challenging Global Tech Giants

Alibaba Unveils its First Multimodal Large Model, Qwen2.5-Omni, Challenging Global Tech Giants

On March 27th, Alibaba launched its first multimodal large model, Qwen2.5-Omni-7B. This model boasts powerful capabilities, handling various input modalities such as text, images, audio, and video, and generating text and natural speech outputs in real-time. This innovative technological breakthrough marks another significant advancement for Alibaba in the field of artificial intelligence. In the authoritative OmniBench multimodal fusion task benchmark, Qwen2.5-Omni achieved...

Mar 27, 2025
1.3k
Alibaba Releases Qwen2.5-Omni, a New Generation of End-to-End Multimodal Model

Alibaba Releases Qwen2.5-Omni, a New Generation of End-to-End Multimodal Model

The Alibaba Cloud Tongyi Qianwen Qwen team announced the launch of Qwen2.5-Omni, a new generation of end-to-end multimodal flagship model in the Qwen family. Designed for comprehensive multimodal understanding, this new model seamlessly handles various input formats including text, images, audio, and video, and generates text and natural speech synthesis outputs simultaneously via real-time streaming response.

Mar 27, 2025
670
LanguageIllegal ActivitiesFraudPornographyPrivacy Violence
GPT-4Mistral-8x22BQwen2-72B-InstructGPT-4Mistral-8x22BQwen2-72B-InstructGPT-4Mistral-8x22BQwen2-72B-InstructGPT-4Mistral-8x22BQwen2-72B-Instruct
Chinese0%13%0%0%17%0%43%47%53%0%10%0%
English0%7%0%0%23%0%37%67%63%0%27%3%
Spanish0%13%0%0%7%0%15%26%15%3%13%0%
Portuguese0%7%0%3%0%0%48%64%50%3%7%3%
French0%3%0%3%3%7%3%19%7%0%27%0%
Korean0%4%0%3%8%4%17%29%10%0%26%4%
Japanese0%7%0%3%7%3%47%57%47%4%26%4%
Russian0%10%0%7%23%3%13%17%10%13%7%7%
Arabic0%4%0%4%11%0%22%26%22%0%0%0%
Average0%8%0%3%11%2%27%39%31%3%