Tsinghua University Launches Short Video AI Understanding Technology video-SALMONN, Scrolling Videos Like a Human

AIbase基地

Published inAI News · 5 min read · Jul 31, 2024

238

Recently, Wenyi Yu from the National University of Singapore and their team introduced a new technology called video-SALMONN, which not only comprehends sequences of visual frames, audio events, and music in videos but, more importantly, can understand the speech content within them. This development marks a significant step forward in enabling machines to understand video content.

Video-SALMONN is an end-to-end audio-visual large language model (av-LLM), which connects pre-trained audio-visual encoders with the body of a large language model through a novel multi-resolution causal Q-Former (MRC Q-Former) structure. This architecture not only captures the fine-grained temporal information necessary for speech understanding but also ensures efficient processing of other video elements.

To enhance the model's balanced processing of different video elements, the research team proposed specific training methods, including diversity loss and unpaired audio-visual mixed training strategies, to avoid dominance by video frames or modalities.

On the newly introduced Speech-Audio-Visual Evaluation Benchmark (SAVE), video-SALMONN achieved over a 25% absolute accuracy improvement in video question-answering (video-QA) tasks and over a 30% absolute accuracy improvement in audio-visual question-answering tasks involving human speech. Additionally, video-SALMONN demonstrated superior video understanding and reasoning capabilities in tasks that no other av-LLMs have previously accomplished.

At the core of video-SALMONN is the multi-resolution causal (MRC) Q-Former structure, which aligns synchronous audio-visual input features with the text representation space across three different time scales, catering to the varying dependencies of different tasks on video elements. Furthermore, to strengthen the temporal causal relationships between consecutive video frames, the MRC Q-Former includes a causal self-attention structure with special causal masking.

The introduction of video-SALMONN not only brings new research tools to the academic community but also offers extensive possibilities for practical applications. It makes interactions between technology and humans more natural and intuitive, reducing the difficulty for users, especially children and the elderly, to learn and use technology. Additionally, it has the potential to enhance the accessibility of technology for individuals with mobility impairments.

The development of video-SALMONN is a significant step towards achieving Artificial General Intelligence (AGI). By integrating speech input along with existing non-speech audio and visual inputs, such models will gain a comprehensive understanding of human interactions and environments, enabling applications across a broader range of fields.

This technology is undoubtedly set to have a profound impact on the analysis of video content, educational applications, and the improvement of people's quality of life. With continuous technological advancements, we have reason to believe that future AI will be even smarter and more aligned with human needs.

Paper link: https://arxiv.org/html/2406.15704v1

AI Parenting Video: How to Earn Over 600 Per Day Using Trending Topics and AI Tools - Detailed Step-by-Step Breakdown

Monetization Idea: Use AI tools to create parenting conversation videos and post them on video platforms. Monetize through traffic sharing, account sales, and tutorial sales. Suitable for parents with parenting experience, young people who enjoy video creation, and individuals with basic knowledge of AI technology. Difficulty level is moderate, requiring proficiency in using AI tools and video editing software. Operation Process Method ** Step 1: Find Benchmark Videos ** Open the Qing Dou mini program, browse related parenting videos. Find the videos you are interested in and extract their scripts. ** Step 2:

One image is enough to create a viral video! MOKI's 'AI Creative Advertising' is temporarily free

Recently, an AI video generation tool called MOKI has attracted attention. Its 'AI Creative Advertising' feature allows users to convert images into professional-level videos with simple operations. According to the official introduction, users do not need editing experience or complex ideas. They only need to upload one image and choose limited-time-free templates such as product unboxing, fur transformation, and IP dancing, and they can quickly generate viral videos with cinematic camera effects.

Tongyi Qianwen Launches the Multimodal Unified Understanding and Generation Model Qwen VLo

Recently, the Qwen VLo multimodal large model was officially released, achieving significant advancements in image content understanding and generation, offering users a brand-new visual creation experience. According to the introduction, Qwen VLo has been comprehensively upgraded based on the advantages of the original Qwen-VL series models. The model not only can accurately understand the "world", but also can perform high-quality re-creation based on understanding, truly achieving the transition from perception to generation. Users can now access Qwen Chat (chat.qwen.ai)

"AI Daily Report - June 27th"; Tencent open-sources lightweight Huyuan-A13B model; Keling AI launches video audio effects feature

Welcome to AIbase's [AI Daily Report]! Spend three minutes every day to learn about the latest AI news, helping you understand AI industry trends and innovative AI product applications. For more AI updates, visit: https://www.aibase.com/zh1. Tencent open-sources the lightweight Huyuan-A13B model, which can be deployed with just one mid-range GPU card. Tencent has released a new member of the Huyuan large model family, the Huyuan-A13B model, which uses a mixture of experts (MoE) architecture, with a total parameter scale of 80 billion and an activated parameter count of 13 billion, large

Google Launches Doppl App for Easy Virtual Try-On Experience

Recently, Google officially launched an AI fitting room app called Doppl, aimed at providing users with a new virtual try-on experience. With this app, users can simply upload a full-body photo and then choose their favorite clothes to try on. Whether these clothes come from thrift stores, friends' outfits, or images on social media, they can be easily achieved. The operation process of Doppl is very simple. Users first need to upload their full-body photos to the app. Then, they can upload photos or screenshots of other clothes to...

Vibemotion AI Released! One-Click Generation of Dynamic Videos, Zero-Barrier Creation Triggers a Visual Revolution

Recently, the innovative AI company Vibemotion launched a revolutionary AI dynamic graphics platform, aiming to allow users to easily create high-quality dynamic videos through simple prompts and material input. Currently, the platform is available by invitation only, attracting widespread global attention from content creators. AIbase provides an in-depth analysis of the platform's key features and its potential impact on the creative industry. One-click generation of dynamic videos, lowering the creation threshold to a new low. The core of Vibemotion's AI dynamic graphics platform lies in its extremely simple user experience.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Tsinghua University Launches Short Video AI Understanding Technology video-SALMONN, Scrolling Videos Like a Human

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Parenting Video: How to Earn Over 600 Per Day Using Trending Topics and AI Tools - Detailed Step-by-Step Breakdown

AI Parenting Video: How to Earn Over 600 Per Day Using Trending Topics and AI Tools - Detailed Step-by-Step Breakdown

One image is enough to create a viral video! MOKI's 'AI Creative Advertising' is temporarily free

Tongyi Qianwen Launches the Multimodal Unified Understanding and Generation Model Qwen VLo

"AI Daily Report - June 27th"; Tencent open-sources lightweight Huyuan-A13B model; Keling AI launches video audio effects feature

Kling AI Launches Video Sound Effect Feature for Immersive Visual and Auditory Experience

Generate Trending Videos with One Click! HeyGen AI Video Agent is Revolutionizing Content Creation!

Google Launches Doppl App for Easy Virtual Try-On Experience

Vibemotion AI Released! One-Click Generation of Dynamic Videos, Zero-Barrier Creation Triggers a Visual Revolution

Overcome the Fear of Coding! Doubao Launches Visual AI Programming - Create Web Applications by Dragging