Recently, Alibaba has launched a new open-source speech model, Qwen2-Audio, based on its Qwen-Audio foundation. This model not only excels in speech recognition, translation, and audio analysis but also demonstrates significant enhancements in functionality and performance. Qwen2-Audio offers both a basic version and an instruction-tuned version, allowing users to query the audio model through voice and analyze the content.

image.png

For instance, users can have a female speaker say a passage, and Qwen2-Audio can determine her age or analyze her emotions; if a noisy audio clip is input, the model can dissect the various sound components within it. Qwen2-Audio supports multiple languages including Mandarin, Cantonese, French, English, and Japanese, which greatly facilitates the development of sentiment analysis and translation applications.

Product Entry: https://top.aibase.com/tool/qwen2-audio

Compared to the first-generation Qwen-Audio, Qwen2-Audio has undergone comprehensive optimization in architecture and performance. During the pre-training phase, this new model adopted more natural language prompts, replacing the previous complex hierarchical labels. This improvement allows the model to handle and respond to various tasks more adeptly, with significant enhancement in generalization capabilities.

The instruction-following capability of Qwen2-Audio has also greatly improved, enabling it to understand user instructions more accurately. For example, when a user issues the instruction "analyze the emotional tone in this audio," Qwen2-Audio can accurately judge the emotions contained in the audio. Additionally, the model introduces voice chat and audio analysis modes, making voice interactions more natural. In the audio analysis mode, Qwen2-Audio can deeply analyze various types of audio and provide detailed and accurate analysis results.

To ensure the model's outputs align with human expectations, Qwen2-Audio incorporates advanced techniques such as supervised fine-tuning and direct preference optimization. During human interaction, the model appears more natural and precise.

In terms of performance testing, Qwen2-Audio has performed excellently in multiple mainstream benchmark tests, especially in the accuracy of speech recognition and translation, surpassing OpenAI's Whisper-large-v3. The performance of this new model has not only sparked widespread industry attention but also heralds a new future for speech technology.

Key Points:

🌟 Qwen2-Audio is Alibaba's latest open-source speech model, supporting multiple languages and possessing powerful recognition and analysis capabilities.

🚀 Compared to its predecessor, Qwen2-Audio has undergone significant optimization in performance and architecture, enhancing its ability to understand and respond.

🏆 In several performance tests, Qwen2-Audio has outperformed OpenAI's Whisper, demonstrating strong competitiveness.