In the field of artificial intelligence, the rapid development of language models has sparked widespread interest in Speech Understanding Language Models (SULMs). Recently, the ASLP laboratory at Northwestern Polytechnical University released the Open Speech Understanding Model (OSUM), aimed at exploring how to effectively train and utilize speech understanding models in situations with limited academic resources, thereby promoting research and innovation in the academic community.

The OSUM model integrates the Whisper encoder with the Qwen2 language model, supporting eight speech tasks, including Automatic Speech Recognition (ASR), Speech Recognition with Timestamps (SRWT), Voice Event Detection (VED), Speech Emotion Recognition (SER), Speaking Style Recognition (SSR), Speaker Gender Classification (SGC), Speaker Age Prediction (SAP), and Speech-to-Text Chat (STTC). By adopting the ASR+X training strategy, the model can efficiently and stably optimize speech recognition while performing target tasks, enhancing its multi-task learning capabilities.

The release of the OSUM model not only focuses on performance but also emphasizes transparency. Its training methods and data preparation processes have been made open to provide valuable references and guidance for the academic community. According to the technical report v2.0, the training data volume for the OSUM model has increased to 50.5K hours, significantly higher than the previous 44.1K hours. This includes 3000 hours of speech gender classification data and 6800 hours of speaker age prediction data. The expansion of this data has significantly improved the model's performance across various tasks.

Evaluation results indicate that OSUM outperforms the Qwen2-Audio model on multiple tasks, even with significantly fewer computational resources and training data. The relevant evaluation results cover not only public test sets but also internal test sets, demonstrating the OSUM model's strong performance in speech understanding tasks.

QQ_1740040417911.png

The ASLP laboratory at Northwestern Polytechnical University stated that the goal of OSUM is to promote the development of advanced speech understanding technologies through an open research platform. Researchers and developers can freely use the model's code and weights, even for commercial purposes, thus accelerating the application and dissemination of the technology.

Project entry: https://github.com/ASLP-lab/OSUM?tab=readme-ov-file

Key Points:  

🌟 The OSUM model combines the Whisper encoder with the Qwen2 language model, supporting various speech tasks and facilitating multi-task learning.  

📊 In the technical report v2.0, the training data volume increased to 50.5K hours, enhancing the model's performance.  

🆓 The model's code and weights are open for use under the Apache 2.0 license, encouraging widespread application in academia and industry.