VLOGGER

Text and voice-driven human video generation from a single input portrait image.

CommonProductVideoVideo generationHuman synthesis
VLOGGER is a method for generating text and audio-driven speaking human videos from a single input portrait image. It builds upon the success of recent generative diffusion models. Our method consists of 1) a random human-to-3D motion diffusion model, and 2) a novel diffusion-based architecture that enhances text-to-image models through temporal and spatial control. This approach enables the generation of high-quality videos of variable length, and is easily controllable through advanced expression of human faces and bodies. Unlike previous work, our method does not require individual training for each person, nor does it rely on face detection and cropping. It generates complete images (rather than just faces or lips), and takes into account the wide range of scenarios required for the correct synthesis of human communication (e.g., visible torsos or diverse subject identities).
Visit

VLOGGER Visit Over Time

Monthly Visits

3699

Bounce Rate

50.37%

Page per Visit

1.5

Visit Duration

00:00:04

VLOGGER Visit Trend

VLOGGER Visit Geography

VLOGGER Traffic Sources

VLOGGER Alternatives