Recently, the research team introduced a large-scale 4D Gaussian reconstruction model named L4GM, which can generate animated objects from single-view video inputs, achieving impressive results.

The key to this model lies in its innovative dataset and simplified design, enabling one-way transmission to be completed within a second while ensuring high-quality output animations.

Video to 4D Synthesis

L4GM can generate 4D objects from videos in just a few seconds. As shown in the video example below, you can see the target object from the original video and the corresponding 4D Gaussian reconstruction model.

Reconstruction of Long, High FPS, Flexible Videos

Additionally, it can reconstruct 10-second long 30fps videos. As shown in the video example below,

4D Interpolation

The team also trained a 4D interpolation model, tripling the frame rate. As shown in the video example below,

Left: Before interpolation. Right: After interpolation

Constructing a Multi-View Video Dataset

The research team constructed a dataset containing multi-view videos, including meticulously crafted and rendered animated objects from Objaverse. This dataset showcases 44,000 diverse objects, covering 110,000 animations from 48 views, totaling 120 million videos and 300 million frames. Based on this dataset, L4GM is built directly on top of the pre-trained 3D large-scale reconstruction model LGM, outputting 3D Gaussian ellipsoids from multi-view image inputs.

L4GM generates a 3D Gaussian splash representation for each frame on low fps sampled video frames, then upsamples the representation to a higher fps for temporal smoothness.

To help the model learn temporal consistency, the research team added temporal self-attention layers on the base LGM and trained the model using multi-view rendering losses at each timestep. By training an interpolation model, the representation is upsampled to a higher frame rate, producing intermediate 3D Gaussian representations.

The research team demonstrated L4GM's good generalization ability on wild videos after training on synthetic data, producing high-quality animated 3D objects. The model accepts single-view videos and single-timestep multi-view images as inputs and outputs a set of 4D Gaussian probability distributions.

Technical Framework

image.png

The model takes single-view videos and single-timestep multi-view images as input and outputs a set of 4D Gaussians. It adopts a U-Net architecture, using cross-view self-attention for view consistency and temporal cross-spacetime self-attention for temporal consistency.

image.png

L4GM allows autoregressive reconstruction, using the multi-view rendering of the last Gaussian as input for the next reconstruction. There is one frame overlap between two consecutive reconstructions. Additionally, the research team trained a 4D interpolation model. The interpolation model receives interpolated multi-view videos rendered from reconstruction results and outputs interpolated Gaussians.

L4GM Application Scenarios Include:

Video Content Generation: L4GM can generate 4D models of animated objects from single-view video inputs, which has broad applications in video special effects production, game development, and more. For example, it can be used to generate special effects animations, virtual scene construction, etc.

Video Reconstruction and Restoration: L4GM can reconstruct long, high-frame-rate videos, useful for video restoration and enhancement, improving video quality and clarity. This can be particularly useful in film restoration, video compression, and video processing.

Video Interpolation: Through the trained 4D interpolation model, L4GM can increase the frame rate of videos, making them smoother. This has potential applications in video editing, slow-motion/fast-motion effect production, etc.

3D Asset Generation: L4GM can generate high-quality animated 3D assets, useful for virtual reality (VR), augmented reality (AR) applications, and 3D model generation in game development.

Product Entry: https://top.aibase.com/tool/l4gm