SALMONN Framework: Expanding General Auditory Capabilities of Large Language Models

站长之家

Published inAI News · 1 min read · Nov 29, 2023

145

SALMONN is an audio-text multimodal large language model framework designed to expand the understanding and processing capabilities of large language models in the general auditory domain. The framework integrates components such as non-speech BEATs audio encoders, the OpenAI Whisper framework's speech encoders, and window-level Q-Former, achieving high levels of temporal resolution for audio-text alignment. After the activation adjustment phase, SALMONN has achieved competitive performance in tasks such as audio captioning and speech translation, demonstrating general auditory capabilities.

SALMONN LLM Multimodal

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Kunlun Wanwei Launches Lightweight Multimodal Agent Skywork R1V4-Lite, Opening a New Era of Intelligent Interaction

Skywork R1V4-Lite, a lightweight multimodal agent by Kunlun Wanwei, integrates vision, reasoning, and planning. It supports image operations, tool use, and complex scene tasks like spatial positioning and text enhancement via photos.....

Nov 18, 2025

200

ElevenLabs Revolutionary Update: One-Stop Generation of Images, Videos, and Music

Multimodal AI company ElevenLabs launches an integrated content creation platform, combining image generation, video production, voice synthesis, music creation, and sound design features, enabling a complete production cycle from script to final video. It helps creators and marketers avoid switching between multiple platforms, efficiently completing commercial video production.

Nov 18, 2025

180

Meta Chief AI Scientist Yann LeCun Is Planning to Leave and Start a Company: Betting on World Models to Challenge the LLM Approach

Meta's chief AI scientist Yann LeCun is leaving to start a venture focused on developing a 'world model' AI, seeking investment for goal-driven AI commercialization, challenging Meta's large language model strategy.....

Nov 18, 2025

130

NotebookLM Upgraded to Support Image Import, Whiteboard Notes Become Searchable Knowledge Base

Google introduces image recognition features for NotebookLM, allowing users to upload whiteboard notes, textbooks, or table images, and automatically recognize text and perform semantic analysis. Users can directly search the content of images using natural language. This feature is free across all platforms and will soon add local processing options to protect privacy. The system uses multimodal technology to distinguish between handwritten and printed text, analyze table structures, and intelligently link with existing notes.

Nov 17, 2025

200

Xiaomi Opensources 7B Multimodal Model MiMo-VL, Promotes AI Assistant Miloco to Automatically Adjust Home Environment

Xiaomi launches 7B-parameter multimodal model 'Xiaomi-MiMo-VL-Miloco-7B-GGUF' and smart assistant 'Xiaomi Miloco'. It uses Mi cameras for real-time activity/gesture recognition to automate smart home devices, supports Home Assistant, and is open-source for non-commercial use with NVIDIA GPU/Docker deployment.....

Nov 17, 2025

210

Next-Generation Multimodal AI DeepEyesV2: Smart Tool to Outperform Larger Models

China's DeepEyesV2 is a multimodal AI that analyzes images, executes code, and performs web searches by leveraging external tools rather than training data, outperforming larger models.....

Nov 17, 2025

120

AI Daily: Feifei Li's Marble 3D World Model Public Beta; OpenAI Launches ChatGPT Group Chat Function for the First Time; Baidu Unveils Multimodal AI Assistant, Super Du

Feifei Li's World Labs launches the public beta version of the Marble 3D World Model, supporting multimodal inputs such as text, images, and videos, quickly generating interactive virtual universes, and helping developers explore AI technology applications.

Nov 14, 2025

310

Baidu Launches a New Multimodal AI Assistant, Super Du Shi, Millions of Devices Can Be Upgraded for Free!

Xiaodu Tech launched the upgraded multimodal AI assistant 'Super Xiaodu' at Baidu World, integrating voice, vision, and spatial data to enhance perception. It supports listening, speaking, and environment recognition, with free upgrades for millions of devices, advancing human-computer interaction for smarter living.....

Nov 14, 2025

220

Wenxin Big Model 5.0 Launches with Great Impact! Baidu Introduces the World's First Native Multimodal Big Model

Baidu World Conference launched Wenxin 5.0, defined as a 'native full-modal model' by Robin Li, integrating text, image, and audio seamlessly, advancing domestic AI beyond traditional multimodal approaches.....

Nov 13, 2025

210

Baidu Launches the New Native Multimodal Large Model ERNIE 5.0

Baidu launched the 2.4 trillion parameter Wenxin 5.0 model at its 2025 conference, featuring native multimodal capabilities for unified text, image, audio, and video processing with strong reasoning, memory, and creative abilities.....

Nov 13, 2025

380

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Deployment Calculator

AI Dataset Collection

Intelligent Document Recognition

SALMONN Framework: Expanding General Auditory Capabilities of Large Language Models

站长之家

This article is from AIbase Daily

AI News Recommendations

Kunlun Wanwei Launches Lightweight Multimodal Agent Skywork R1V4-Lite, Opening a New Era of Intelligent Interaction

ElevenLabs Revolutionary Update: One-Stop Generation of Images, Videos, and Music

Meta Chief AI Scientist Yann LeCun Is Planning to Leave and Start a Company: Betting on World Models to Challenge the LLM Approach

NotebookLM Upgraded to Support Image Import, Whiteboard Notes Become Searchable Knowledge Base

Xiaomi Opensources 7B Multimodal Model MiMo-VL, Promotes AI Assistant Miloco to Automatically Adjust Home Environment

Next-Generation Multimodal AI DeepEyesV2: Smart Tool to Outperform Larger Models

AI Daily: Feifei Li's Marble 3D World Model Public Beta; OpenAI Launches ChatGPT Group Chat Function for the First Time; Baidu Unveils Multimodal AI Assistant, Super Du

Baidu Launches a New Multimodal AI Assistant, Super Du Shi, Millions of Devices Can Be Upgraded for Free!

Wenxin Big Model 5.0 Launches with Great Impact! Baidu Introduces the World's First Native Multimodal Big Model

Baidu Launches the New Native Multimodal Large Model ERNIE 5.0

GEO Services