Intelligent Spectrum AI has announced the open-source upgrade of the CogVLM2-Video model, a significant advancement in the field of video understanding. The CogVLM2-Video model addresses the limitations of existing video understanding models in handling missing temporal information by introducing multi-frame video images and timestamps as input to the encoder. Utilizing an automated method for constructing time localization data, it has generated 30,000 temporal-related video question and answer data, thereby training a model that achieves the latest performance on public video understanding benchmarks. The CogVLM2-Video model excels in video subtitle generation and time localization, providing powerful tools for video generation and summarization tasks.

The CogVLM2-Video model achieves time localization and related question answering by extracting frames from the input video and annotating timestamp information, enabling the language model to accurately know the corresponding time of each frame.

WeChat Screenshot_20240712135239.png

To facilitate large-scale training, an automated video question and answer data generation process has been developed. By combining the use of an image understanding model and a large language model, it reduces annotation costs and improves data quality. The constructed Temporal Grounding Question and Answer (TQA) dataset contains 30,000 records, providing rich time localization data for model training.

The CogVLM2-Video model has demonstrated excellent performance on multiple public evaluation datasets, including outstanding results in quantitative assessment indicators such as VideoChatGPT-Bench, Zero-shot QA, and MVBench.

Code: https://github.com/THUDM/CogVLM2

Project Website: https://cogvlm2-video.github.io

Online Trial: http://36.103.203.44:7868/