Meta Open Sources Long Video LLM Project LongVU: Filters Duplicate Frames for Efficient and Accurate Understanding of Long Video Content

Recently, the Meta AI team has introduced LongVU, a novel spatio-temporal adaptive compression mechanism designed to enhance language understanding capabilities for long videos. Traditional multimodal large language models (MLLMs) face limitations in context length when dealing with long videos, and LongVU was created to address this challenge.

LongVU operates primarily by filtering out repetitive frames and compressing tokens across frames to efficiently utilize context length, reducing redundant information in the video while maintaining visual details.

Specifically, the team uses DINOv2 features to eliminate highly similar redundant frames. Subsequently, through text-guided cross-modal queries, they achieve selective reduction of frame features.

Additionally, LongVU addresses temporal dependencies between frames with innovative spatial token compression, allowing it to handle a large number of frames within a limited context length with minimal loss of visual information.

In various video understanding benchmark tests, LongVU outperforms existing methods, especially in tasks requiring comprehension of hour-long videos such as VideoMME and MLVU. Even under resource-light LLMs, LongVU demonstrates excellent performance with a smaller model size.

In application examples of LongVU, users can obtain detailed descriptions of videos through simple questions. For instance, when asked, "What is happening in this video?" LongVU accurately describes the scenes, characters, and interactions within the video. Whether it's identifying objects in the video or complex narrative descriptions, LongVU provides quick and accurate feedback.

Official Example:

User prompt: Please provide a detailed description of the video.

LongVU's response: The video begins with a scene set in a futuristic cityscape at night, where a humanoid robot stands in the foreground, its back to the camera, against a backdrop of illuminated buildings and a helicopter in flight. The setting is characterized by a blend of modern and cyberpunk elements, with neon signs and a wet street reflecting the city lights. As the video progresses, the robot remains stationary, with the helicopter moving closer, eventually hovering above the street, indicating a possible interaction or event unfolding between the robot and the helicopter.

The scene then shifts to an interior setting, likely a workshop or garage, filled with various electronic equipment and machinery. The robot is now in the foreground, facing the camera, with a person in the background, suggesting a narrative progression. The environment is cluttered, with the lighting casting a blue hue over the scene, emphasizing the technological theme.

As the video continues, the camera angle changes to show the robot from behind, with the person in the background now more visible, indicating movement. The workshop setting remains consistent, with the clutter and electronic equipment unchanged. The lighting and color scheme continue to highlight the technological ambiance.

The perspective then shifts to a side view of the robot, revealing more of its structure and the cluttered workshop environment. The person in the background is now closer to the robot, suggesting interaction or movement towards it. The lighting and color scheme remain consistent, maintaining the technological atmosphere.

Finally, the camera zooms in on the robot, focusing on its upper body and the intricate details of its internal components. The cluttered workshop environment becomes less visible, with the lighting casting a blue hue over the scene, emphasizing the technological theme. The camera's close-up shot highlights the robot's design and the complexity of its internal mechanisms, suggesting a narrative focus on the robot's internal workings or its role within the technological setting.

The rapid growth of long video content requires more efficient processing methods, and the introduction of LongVU undoubtedly brings new possibilities to the field of multimodal understanding.

Project entry: https://vision-cair.github.io/LongVU/

Key Points:

1. 📽️ LongVU is a novel spatio-temporal adaptive compression mechanism aimed at enhancing language understanding capabilities for long videos.

2. 🔍 The technology utilizes DINOv2 features to eliminate redundant frames and achieves selective feature compression through cross-modal queries.

3. 🚀 LongVU excels in various video understanding benchmark tests, particularly in tasks involving long video comprehension, surpassing other methods.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Meta Open Sources Long Video LLM Project LongVU: Filters Duplicate Frames for Efficient and Accurate Understanding of Long Video Content

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

City Commercial Banks Are Launching a Trend of Large Model Bidding, with Million-Level Investments Becoming a New Industry Opportunity!

Personification of Large AI Models: Grok 4 and Empathy with Musk?

vivo New Multimodal Model Launches! AI's Ability to Understand GUI Interfaces is Upgraded Again!

AI Daily: xAI Shockingly Launches Grok4; Microsoft Opensources New Phi-4-mini Version; Shanghai has Cumulatively 82 Large Models Passed Filing

Shanghai has completed the filing of 82 large models

OpenAI Plans to Release Open-Weight Models, Breaking the Closed-Source Convention

NVIDIA Collaborates with Hong Kong University and Others to Launch Fast KV Cache, Aiding in Accelerating Diffusion Models

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

Ali HumanOmniV2 Launches with a Shock: The New King of Multimodal AI, Accuracy Surges to 69.33%