Alibaba Cloud has recently released a large-scale audio language model named Qwen-Audio, which can accept various audio signal inputs and perform audio analysis or directly respond to voice commands, significantly enhancing the voice interaction experience.

image.png

Product Entry:https://top.aibase.com/tool/qwen2-audio

In this release, Qwen2-Audio offers two unique voice interaction modes: voice chat and audio analysis. Users can interact with Qwen2-Audio via voice without the need for text input, and can also provide audio and text commands for analysis, bringing a more convenient experience.

Qwen2-Audio can intelligently understand the content in audio and respond appropriately to voice commands. For example, in an audio segment that includes sound, multi-speaker dialogue, and voice commands, Qwen2-Audio can directly understand the command and provide an explanation and response to the audio.

Additionally, the DPO has optimized the model's performance in terms of factual accuracy and adherence to expected behaviors. According to the AIR-Bench evaluation results, Qwen2-Audio outperforms previous SOTA models, such as Gemini-1.5-pro, in tests focused on audio-centric instruction tracking functions. Qwen2-Audio is open-source, aiming to promote the advancement of the multimodal language community.

It is reported that the Qwen2-Audio series will introduce two models: Qwen2-Audio and Qwen-Audio-Chat, offering users a richer audio interaction experience.

Researchers will conduct a comprehensive evaluation of the Qwen2-Audio model, assessing its performance across various tasks without any task-specific fine-tuning. In terms of English automatic speech recognition (ASR) results, Qwen2-Audio demonstrates higher performance compared to previous multi-task learning models.

image.png

In terms of chat capabilities, researchers measured the performance of Qwen2-Audio on the chat benchmark of AIR-Bench (Yang et al., 2024), showing state-of-the-art (SOTA) instruction tracking functions across speech, sound music, and mixed audio subsets. Compared to Qwen-Audio, it shows substantial improvements and significantly outperforms other LALMs.

Key Points:

🌟 Alibaba Cloud releases Qwen2-Audio, a groundbreaking large-scale audio language model, enhancing the voice interaction experience;

Qwen2-Audio can accept various audio signal inputs for audio analysis or directly respond to voice commands, greatly expanding voice interaction capabilities;

🌟 Through a three-stage training process, the model structure, training method, and performance of Qwen2-Audio are fully demonstrated, providing users with a more premium audio interaction experience.