VLOGGER
Text and voice-driven human video generation from a single input portrait image.
CommonProductVideoVideo generationHuman synthesis
VLOGGER is a method for generating text and audio-driven speaking human videos from a single input portrait image. It builds upon the success of recent generative diffusion models. Our method consists of 1) a random human-to-3D motion diffusion model, and 2) a novel diffusion-based architecture that enhances text-to-image models through temporal and spatial control. This approach enables the generation of high-quality videos of variable length, and is easily controllable through advanced expression of human faces and bodies. Unlike previous work, our method does not require individual training for each person, nor does it rely on face detection and cropping. It generates complete images (rather than just faces or lips), and takes into account the wide range of scenarios required for the correct synthesis of human communication (e.g., visible torsos or diverse subject identities).
VLOGGER Visit Over Time
Monthly Visits
2811
Bounce Rate
53.86%
Page per Visit
1.2
Visit Duration
00:00:00