Recently, researchers have introduced the ShareGPT4Video series, aimed at enhancing the understanding of large-scale video language models (LVLMs) and the generation of text-to-video models (T2VMs) through meticulous and comprehensive subtitles.

The ShareGPT4Video series includes:

1) ShareGPT4Video, developed through a carefully designed data filtering and annotation strategy, consisting of dense subtitles for 40,000 videos of varying lengths and sources annotated by GPT4V.

2) ShareCaptioner-Video, an efficient and powerful video captioning model suitable for any video, which has annotated 4,800,000 high-quality aesthetic videos.

3) ShareGPT4Video-8B, a simple yet outstanding LVLM, achieving state-of-the-art performance in three advanced video benchmarks.

In addition to the non-scalable and costly human annotators, the study found that using GPT4V to generate video subtitles with simple multi-frame or frame concatenation input strategies resulted in a lack of detail and occasional temporal confusion. The research team believes that the challenge in designing high-quality video captioning strategies lies in three aspects:

1) Understanding precise temporal changes between frames.

2) Describing detailed content within frames.

3) Scalability of frame numbers for videos of any length.

To address this, researchers have meticulously designed a differential video captioning strategy that is stable, scalable, and efficient for generating subtitles for videos of any resolution, aspect ratio, and length. Based on this, ShareGPT4Video was constructed, including 40,000 high-quality videos covering a wide range of categories, with generated subtitles rich in world knowledge, object attributes, camera movements, and key event details with precise temporal descriptions.

Building on ShareGPT4Video, ShareCaptioner-Video, an excellent caption generation model, was further developed to efficiently produce high-quality subtitles for any video. It has annotated 4,800,000 aesthetically appealing videos and validated their effectiveness in a 10-second text-to-video generation task. ShareCaptioner-Video is a four-in-one outstanding video captioning model with functions: fast captioning, sliding captions, clip summarization, and quick re-captioning.

image.png

In terms of video understanding, the research team also validated the effectiveness of ShareGPT4Video on several current LVLM architectures and presented the excellent new LVLM ShareGPT4Video-8B.

Product Entry: https://top.aibase.com/tool/sharegpt4video