Traditional video understanding models face numerous challenges when processing long videos, including the complexities brought about by the extensive context of long videos. Despite many studies aimed at enhancing video understanding capabilities, effectively overcoming the issues of low training and inference efficiency remains difficult. To address these challenges, the research team utilized HiCo technology to compress the redundant parts of video information, significantly reducing computational demands while preserving key information.

image.png

Specifically, HiCo employs hierarchical compression of videos, segmenting long videos into shorter clips, thereby reducing the number of tokens processed. This approach not only lowers the model's computational resource requirements but also increases the width of the context window, enhancing the model's processing capabilities. Furthermore, the research team utilized semantic associations with user queries to further decrease the number of video tokens.

In the practical implementation of long video processing, "VideoChat-Flash" adopts a multi-stage learning scheme that transitions from short videos to long videos. Researchers first use short videos and their corresponding annotations for supervised fine-tuning, then gradually introduce long videos for training, ultimately achieving comprehensive understanding of mixed-length corpora. This method not only improves the model's visual perception abilities but also provides rich data support for long video processing, as the research team constructed a vast dataset containing 300,000 hours of video and 200 million words of annotations.

Additionally, the study introduced an improved "needle in a haystack" task for multi-hop video configurations. Through the new benchmark, the model is required not only to locate a single target image within the video but also to understand multiple interconnected image sequences, thereby enhancing the model's contextual understanding capabilities.

Experimental results indicate that the proposed method reduces computational demands by two orders of magnitude, particularly excelling in benchmark tests for both short and long videos, establishing itself as a leader in the new field of short video understanding. Simultaneously, this model also outperformed existing open-source models in long video understanding, demonstrating robust temporal localization capabilities.

Paper: https://arxiv.org/abs/2501.00574

Key Highlights:

🌟 Researchers proposed the hierarchical video token compression technology HiCo, significantly reducing computational demands for long video processing.  

📹 The "VideoChat-Flash" system employs a multi-stage learning approach, integrating short and long videos for training, enhancing the model's understanding capabilities.  

🔍 Experimental results show that this method achieved new performance standards across multiple benchmarks, becoming an advanced model in the field of long video processing.