Meta开源长视频LLM项目LongVU：可过滤重复帧高效精准理解长视频内容

最近，Meta AI 团队带来了 LongVU，这是一种新颖的时空自适应压缩机制，旨在提升长视频的语言理解能力。传统的多模态大型语言模型（MLLMs）在处理长视频时面临着上下文长度的限制，而 LongVU 正是为了解决这一难题而诞生。

LongVU 的工作原理主要通过过滤重复帧、跨帧token压缩等方法来高效使用上下文长度，能够在保持视频视觉细节的同时减少视频中的冗余信息。

具体来说，团队使用 DINOv2的特征来剔除那些高度相似的冗余帧。接着，通过文本引导的跨模态查询，实现了选择性地减少帧特征的效果。

此外，LongVU 还针对帧间的时间依赖性进行了空间令牌的压缩这一创新的压缩策略使得 LongVU 能够在有限的上下文长度内，有效地处理大量的帧，并且几乎没有视觉信息的损失。

在各种视频理解基准测试中，LongVU 的表现均超越了现有的其他方法，尤其是在需要理解长达一小时的视频任务中，如 VideoMME 和 MLVU 等。即便是在资源较轻的 LLM 下，LongVU 也能够展现出卓越的表现，且模型规模较小。

在 LongVU 的应用示例中，用户可以通过简单的问题获取视频的详细描述。例如，用户询问 “这段视频的情况如何?”LongVU 能够准确描述视频中的场景、角色及其互动过程。无论是简单的询问视频中的物体，还是复杂的情节描述，LongVU 都能够迅速并准确地提供反馈。

官方示例：
用户提示词：Please provide a detailed description of the video.
LongVU给出的解答: The video begins with a scene set in a futuristic cityscape at night， where a humanoid robot stands in the foreground， its back to the camera， against a backdrop of illuminated buildings and a helicopter in flight. The setting is characterized by a blend of modern and cyberpunk elements， with neon signs and a wet street reflecting the city lights. As the video progresses， the robot remains stationary， with the helicopter moving closer， eventually hovering above the street， indicating a possible interaction or event unfolding between the robot and the helicopter.
The scene then shifts to an interior setting， likely a workshop or garage， filled with various electronic equipment and machinery. The robot is now in the foreground， facing the camera， with a person in the background， suggesting a narrative progression. The environment is cluttered， with the lighting casting a blue hue over the scene， emphasizing the technological theme.
As the video continues， the camera angle changes to show the robot from behind， with the person in the background now more visible， indicating movement. The workshop setting remains consistent， with the clutter and electronic equipment unchanged. The lighting and color scheme continue to highlight the technological ambiance.
The perspective then shifts to a side view of the robot， revealing more of its structure and the cluttered workshop environment. The person in the background is now closer to the robot， suggesting interaction or movement towards it. The lighting and color scheme remain consistent， maintaining the technological atmosphere.
Finally， the camera zooms in on the robot， focusing on its upper body and the intricate details of its internal components. The cluttered workshop environment becomes less visible， with the lighting casting a blue hue over the scene， emphasizing the technological theme. The camera's close-up shot highlights the robot's design and the complexity of its internal mechanisms， suggesting a narrative focus on the robot's internal workings or its role within the technological setting.

长视频内容的快速增长需要更加高效的处理方式，LongVU 的推出无疑为多模态理解领域带来了新的可能性。

项目入口：https://vision-cair.github.io/LongVU/