SALMONN Framework: Expanding General Auditory Capabilities of Large Language Models
站长之家
96
SALMONN is an audio-text multimodal large language model framework designed to expand the understanding and processing capabilities of large language models in the general auditory domain. The framework integrates components such as non-speech BEATs audio encoders, the OpenAI Whisper framework's speech encoders, and window-level Q-Former, achieving high levels of temporal resolution for audio-text alignment. After the activation adjustment phase, SALMONN has achieved competitive performance in tasks such as audio captioning and speech translation, demonstrating general auditory capabilities.
© Copyright AIbase Base 2024, Click to View Source - https://www.aibase.com/news/3667