Recently, a research team from Tsinghua University open-sourced their latest research achievement—Video-T1. The core of this technology lies in Test-Time Scaling (TTS), aiming to significantly improve the quality and text-prompt consistency of generated videos by investing more computational resources during the inference stage of video generation, without the need for costly model retraining. This innovative approach opens up new possibilities in the field of video generation.
What is "Test-Time Scaling"?
In the field of Large Language Models (LLMs), researchers have found that increasing computation during the testing phase can effectively improve model performance. Video-T1 borrows this idea and applies it to the field of video generation. Simply put, traditional video generation models directly generate a video after receiving a text prompt.
However, Video-T1, employing TTS, is like performing multiple "searches" and "selections" during video generation. It generates multiple candidate videos and uses a "test verifier" to evaluate them, ultimately selecting the highest-quality video. This is like a meticulous artist who tries various methods and details before completing the final work.
Core Technology of Video-T1
Video-T1 doesn't directly increase training costs but focuses on how to more effectively utilize the existing model's capabilities. Its core method can be understood as finding a better video generation trajectory in the model's "noise space". To achieve this goal, the research team proposed two main search strategies:
Random Linear Search: This method generates multiple candidate video segments by randomly sampling multiple Gaussian noises and letting the video generation model gradually denoise these noises. A test verifier then scores these candidate videos, and the video with the highest score is selected.
Tree-of-Frames (ToF): Considering that performing full-step denoising on all frames simultaneously would lead to enormous computational costs, ToF adopts a more efficient strategy. It divides the video generation process into three stages: first, performing image-level alignment, which affects the generation of subsequent frames; second, using dynamic prompts in the test verifier, focusing on motion stability and physical plausibility, and guiding the search process based on feedback; finally, evaluating the overall quality of the video and selecting the video with the highest alignment to the text prompt. ToF's autoregressive approach allows for a more intelligent exploration of video generation possibilities.
Significant Effects of TTS
Experimental results show that model performance continues to improve as test-time computation increases (i.e., more candidate videos are generated). This means that even with the same video generation model, by investing more inference time, it's possible to produce higher-quality videos that are more consistent with the text prompts. Researchers conducted experiments on multiple video generation models, and the results all showed that TTS can consistently improve performance. Different test verifiers focus on different aspects of evaluation, so there are also differences in the rate and extent of performance improvement.
Video-T1's TTS method has achieved significant improvements in common prompt categories (such as scenes, objects) and easily evaluable dimensions (such as image quality). Observing the official video demonstrations shows that videos processed with TTS have noticeable improvements in clarity, detail, and adherence to the text description. For example, in a video describing "a cat wearing sunglasses acting as a lifeguard by the pool," after TTS processing, the cat's image is clearer, and the lifeguard's actions are more natural.
Challenges and Prospects
Although TTS has brought significant progress in many aspects, researchers also point out that for some hard-to-evaluate potential attributes, such as motion smoothness and temporal consistency (avoiding flickering), the improvement effect of TTS is relatively limited. This is mainly because these attributes require precise control of cross-frame motion trajectories, and current video generation models still face challenges in this area.
Tsinghua University's open-sourced Video-T1, through its innovative test-time scaling strategy, provides a new and effective way to improve video generation quality. It avoids costly retraining and instead unlocks greater capabilities from existing models by more intelligently utilizing inference-time computational resources. With further research, we can expect TTS technology to play an increasingly important role in the field of video generation.
Project: https://top.aibase.com/tool/video-t1