Website Master (ChinaZ.com) June 17 News: Recently, Zhipu, Tsinghua University, and Peking University have collaborated to launch LVBench, a long-form video understanding benchmark project. While existing multimodal large language models have made significant strides in short video comprehension, they still face challenges when dealing with lengthy videos spanning hours. To fill this gap, LVBench has emerged.

QQ Screenshot 20240617145826.png

This project includes hours of QA data across 6 main categories and 21 subcategories, covering various types of video content such as TV dramas, sports broadcasts, and daily surveillance footage sourced from public domains. All data has been meticulously annotated and challenging questions have been selected using large language models. It is reported that the LVBench dataset covers tasks such as video summarization, event detection, character recognition, and scene understanding.

QQ Screenshot 20240617145801.png

The launch of the LVBench benchmark aims not only to test models' reasoning and operational capabilities in long-form video scenarios but also to drive breakthroughs and innovations in related technologies, injecting new momentum into applications such as embodied intelligent decision-making in long videos, in-depth film reviews, and professional sports commentary.

Many research institutions are already working on the LVBench dataset, gradually expanding the boundaries of artificial intelligence in understanding long-term information streams by developing large models for long-form video tasks, and injecting new vitality into the ongoing exploration in video understanding and multimodal learning.

GitHub: https://github.com/THUDM/LVBench

Project: https://lvbench.github.io

Paper: https://arxiv.org/abs/2406.08035