Recently, the Meta AI team has introduced LongVU, a novel spatio-temporal adaptive compression mechanism designed to enhance language understanding capabilities for long videos. Traditional multimodal large language models (MLLMs) face limitations in context length when dealing with long videos, and LongVU was created to address this challenge.
LongVU operates primarily by filtering out repetitive frames and compressing tokens across frames to efficiently utilize context length, reducing redundant information in the video while maintaining visual details.
Specifically, the team uses DINOv2 features to eliminate highly similar redundant frames. Subsequently, through text-guided cross-modal queries, they achieve selective reduction of frame features.
Additionally, LongVU addresses temporal dependencies between frames with innovative spatial token compression, allowing it to handle a large number of frames within a limited context length with minimal loss of visual information.
In various video understanding benchmark tests, LongVU outperforms existing methods, especially in tasks requiring comprehension of hour-long videos such as VideoMME and MLVU. Even under resource-light LLMs, LongVU demonstrates excellent performance with a smaller model size.
In application examples of LongVU, users can obtain detailed descriptions of videos through simple questions. For instance, when asked, "What is happening in this video?" LongVU accurately describes the scenes, characters, and interactions within the video. Whether it's identifying objects in the video or complex narrative descriptions, LongVU provides quick and accurate feedback.
Official Example:
User prompt: Please provide a detailed description of the video.
LongVU's response: The video begins with a scene set in a futuristic cityscape at night, where a humanoid robot stands in the foreground, its back to the camera, against a backdrop of illuminated buildings and a helicopter in flight. The setting is characterized by a blend of modern and cyberpunk elements, with neon signs and a wet street reflecting the city lights. As the video progresses, the robot remains stationary, with the helicopter moving closer, eventually hovering above the street, indicating a possible interaction or event unfolding between the robot and the helicopter.
The scene then shifts to an interior setting, likely a workshop or garage, filled with various electronic equipment and machinery. The robot is now in the foreground, facing the camera, with a person in the background, suggesting a narrative progression. The environment is cluttered, with the lighting casting a blue hue over the scene, emphasizing the technological theme.
As the video continues, the camera angle changes to show the robot from behind, with the person in the background now more visible, indicating movement. The workshop setting remains consistent, with the clutter and electronic equipment unchanged. The lighting and color scheme continue to highlight the technological ambiance.
The perspective then shifts to a side view of the robot, revealing more of its structure and the cluttered workshop environment. The person in the background is now closer to the robot, suggesting interaction or movement towards it. The lighting and color scheme remain consistent, maintaining the technological atmosphere.
Finally, the camera zooms in on the robot, focusing on its upper body and the intricate details of its internal components. The cluttered workshop environment becomes less visible, with the lighting casting a blue hue over the scene, emphasizing the technological theme. The camera's close-up shot highlights the robot's design and the complexity of its internal mechanisms, suggesting a narrative focus on the robot's internal workings or its role within the technological setting.
The rapid growth of long video content requires more efficient processing methods, and the introduction of LongVU undoubtedly brings new possibilities to the field of multimodal understanding.
Project entry: https://vision-cair.github.io/LongVU/
Key Points:
1. 📽️ LongVU is a novel spatio-temporal adaptive compression mechanism aimed at enhancing language understanding capabilities for long videos.
2. 🔍 The technology utilizes DINOv2 features to eliminate redundant frames and achieves selective feature compression through cross-modal queries.
3. 🚀 LongVU excels in various video understanding benchmark tests, particularly in tasks involving long video comprehension, surpassing other methods.