Recently, the research team at Meta Reality Labs, in collaboration with Efficient, released an innovative generative model called "Pippo," which can generate a high-resolution video of up to 1K from a casually taken photo. This groundbreaking technology marks another significant advancement in the fields of computer vision and image generation.

QQ_1739759486317.png

The core of the Pippo model lies in its design of a multi-view diffusion transformer. Unlike traditional generative models, Pippo does not require any additional inputs, such as fitted parameter models or camera parameters used to capture the image. Users simply need to provide a regular photo, and the system can automatically generate a multi-angle video effect, presenting a more vivid and three-dimensional representation of characters.

To facilitate developers, Pippo is released as a code-only version without pre-trained weights. The research team provides the necessary models, configuration files, inference code, and sample training code from the Ava-256 dataset. Developers can quickly get started with training and application by cloning and setting up the codebase using simple commands.

The future plans for the Pippo project include organizing and cleaning up the code, as well as releasing inference scripts for pre-trained models. These improvements will further enhance user experience and promote the widespread use of this technology in practical applications.

Project: https://github.com/facebookresearch/pippo

Key Points:

🌟 The Pippo model can generate high-resolution multi-view videos from a single ordinary photo without any additional input.

💻 Code is released only, with no pre-trained weights; developers can train the model and apply it themselves.

🔍 The team plans to introduce more features and improvements in the future to enhance user experience.