Google Research has recently introduced a groundbreaking technology called ReCapture, which allows users to re-experience their own videos from entirely new perspectives. This innovative technology generates a customized camera trajectory version of the provided video, enabling viewers to observe the content from angles not originally present in the footage while preserving the original motion of the characters and scenes.
So, how does ReCapture achieve this "magic"? The underlying principle is actually quite straightforward. It begins by utilizing multi-view diffusion models or point cloud rendering techniques to generate a rough video based on the desired new perspective. This preliminary video is akin to an unpolished jade, potentially incomplete and temporally inconsistent, appearing shaky as if intoxicated.
Next, ReCapture employs its secret weapon—the "mask video refinement" technique—to meticulously polish this rough video. This technique is akin to a skilled artisan using two special tools—spatial LoRA and temporal LoRA—for repair and optimization. The spatial LoRA acts as a "beautician," learning the character and scene information from the original video to enhance clarity and aesthetics. Meanwhile, the temporal LoRA functions as a "rhythm master," learning the scene motion from the new perspective to ensure smoother and more natural video playback.
After the collaborative efforts of these two "masters," the rough video transforms into a clear, coherent, and dynamic new video. Furthermore, to perfect the video's appearance, ReCapture uses SDEdit technology for final touch-ups, akin to makeup, making the video more refined and delicate.
Google researchers state that ReCapture does not require extensive training data to handle various types of videos and perspective transformations. This means that even casual video enthusiasts can easily produce professional-grade "multi-camera" videos using ReCapture.
Project Link: https://generative-video-camera-controls.github.io/