Website Home (ChinaZ.com) June 11 News: The Tencent Hunyuan team, in collaboration with Sun Yat-sen University and the Hong Kong University of Science and Technology, has introduced a new image-to-video model named "Follow-Your-Pose-v2". This model has achieved a leap from single to multi-person video generation, capable of handling group photos to make everyone move simultaneously in the video.
Key Highlights:
Supports multi-person video action generation: Achieves the generation of multi-person video actions with less inference time.
Strong generalization capability: Generates high-quality videos regardless of age, clothing, ethnicity, background clutter, or action complexity.
Usable with everyday photos/videos: The model can be trained and generate using everyday photos (including snapshots) or videos without the need for high-quality images/videos.
Correctly handles character occlusion: Can generate occlusion scenes with correct front-back relationships when multiple characters' bodies overlap in a single image.
Technical Implementation:
The model utilizes a "flow guidance" to introduce background flow information, enabling the generation of stable background animations even with camera shake or unstable backgrounds.
Through "inference graph guidance" and "depth map guidance", the model can better understand the spatial information of characters in the image and the spatial relationships between multiple characters, effectively solving multi-character animation and body occlusion issues.
Evaluation and Comparison:
The team proposed a new benchmark, Multi-Character, containing approximately 4000 frames of multi-character video to evaluate multi-character generation effects.
Experimental results show that "Follow-Your-Pose-v2" outperforms the latest technology by more than 35% in performance across two public datasets (TikTok and TED Talks) and seven metrics.
Application Prospects:
Image-to-video generation technology has broad application prospects in industries such as film content production, augmented reality, game development, and advertising, making it one of the highly anticipated AI technologies in 2024.
Additional Information:
The Tencent Hunyuan team also released an acceleration library for the text-to-image open-source large model (HunyuanDiT), significantly improving inference efficiency, with image generation time reduced by 75%.
The HunyuanDiT model has lowered the usage threshold, allowing users to call the model with three lines of code from the official model repository on Hugging Face.
Paper link: https://arxiv.org/pdf/2406.03035
Project page: https://top.aibase.com/tool/follow-your-pose