In the world of AI, making machines understand videos is much harder than understanding images. Videos are dynamic, with sound, movement, and a myriad of complex scenes. In the past, AI viewed videos as if they were reading ancient scrolls, often leaving them baffled.

But the introduction of VideoPrism might change everything. It's a video encoder developed by Google's research team that can achieve state-of-the-art levels on a variety of video understanding tasks with a single model. Whether it's classifying videos, locating objects, generating subtitles, or answering questions about videos, VideoPrism can handle it effortlessly.

image.png

How to Train VideoPrism?

The process of training VideoPrism is like teaching a child to observe the world. First, you need to show it a variety of videos, ranging from everyday life to scientific observations, and everything in between. Then, you train it using "high-quality" video-subtitle pairs, as well as noisy parallel texts (such as text from automatic speech recognition).

Pre-training Methods

Data: VideoPrism uses 36 million high-quality video-subtitle pairs and 582 million video segments with noisy parallel texts.

Model Architecture: Based on the standard Vision Transformer (ViT), employing a factored design in both space and time.

Training Algorithms: Include video-text contrastive training and masked video modeling.

image.png

During the training process, VideoPrism undergoes two stages. In the first stage, it learns the relationship between videos and texts through contrastive learning and global-local distillation. In the second stage, it further enhances its understanding of video content through masked video modeling.

Researchers tested VideoPrism on multiple video understanding tasks, and the results were impressive. In 33 benchmark tests, VideoPrism achieved the state-of-the-art level in 30 of them. Whether it's answering questions about online videos or computer vision tasks in scientific fields, VideoPrism has demonstrated strong capabilities.

The birth of VideoPrism brings new possibilities to the field of AI video understanding. It not only helps AI understand video content better but may also play a significant role in various fields such as education, entertainment, and security.

However, VideoPrism also faces some challenges, such as how to handle long videos and how to avoid introducing biases during training. These are issues that future research needs to address.

Paper Address: https://arxiv.org/pdf/2402.13217