Currently, Multimodal Large Language Models (MLLM) have made significant strides in the field of video understanding, but handling ultra-long videos remains a challenge. This is due to the fact that MLLMs typically struggle to process thousands of visual tokens beyond their maximum context length and are affected by information decay caused by token aggregation. Additionally, the vast number of video tokens incurs high computational costs.

To address these issues, the Beijing Academy of Artificial Intelligence, in collaboration with Shanghai Jiao Tong University, Renmin University of China, Peking University, and Beijing University of Posts and Telecommunications, has proposed Video-XL, a super-long visual language model designed for efficient hour-long video understanding. The core of Video-XL lies in the "Visual Context Latent Summary" technology, which leverages the inherent context modeling capabilities of LLMs to effectively compress long visual representations into a more compact form.

image.png

In simple terms, it compresses video content into a more concise form, similar to condensing an entire cow into a bowl of beef essence, making it easier for the model to digest and absorb.

This compression technology not only improves efficiency but also effectively retains key video information. It's important to note that long videos often contain a lot of redundant information, much like an old lady's裹脚布, long and smelly. Video-XL can accurately eliminate these useless details, preserving only the essence, ensuring that the model does not lose its way when understanding long video content.

image.png

Video-XL is not only theoretically impressive but also demonstrates formidable practical capabilities. It has achieved leading results in multiple long video understanding benchmark tests, especially in the VNBench test, where its accuracy is nearly 10% higher than the current best methods.

More remarkably, Video-XL achieves an astonishing balance between efficiency and effectiveness. It can process 2048 frames of video on a single 80GB GPU, maintaining nearly 95% accuracy in the "needle in a haystack" evaluation.

The application prospects of Video-XL are also extensive. Beyond understanding general long videos, it can handle specific tasks such as movie summarization, surveillance anomaly detection, and ad placement recognition.

This means that in the future, watching movies will no longer require enduring lengthy plots; simply use Video-XL to generate a concise summary, saving time and effort. Alternatively, use it for monitoring surveillance footage, automatically identifying abnormal events, which is far more efficient than manual surveillance.

Project Link: https://github.com/VectorSpaceLab/Video-XL

Paper: https://arxiv.org/pdf/2409.14485