The Beijing Academy of Artificial Intelligence (BAAI) recently announced the launch of an innovative 3D generation model called See3D, which is capable of learning from large-scale unlabeled internet videos. This technological breakthrough marks an important step towards the concept of "See Video, Get 3D." The See3D model does not rely on traditional camera parameters but instead employs visual conditioning techniques to generate camera-direction controllable and geometrically consistent multi-view images solely from visual cues in the video. This method eliminates the need for expensive 3D or camera annotations, allowing for efficient learning of 3D priors from internet videos.
The See3D model supports generation from text, single views, and sparse views to 3D, and can perform 3D editing and Gaussian rendering. The model, code, and demo have been open-sourced for further technical reference. Demonstrations of See3D's capabilities include unlocking 3D interactive worlds, 3D reconstruction based on sparse images, open-world 3D generation, and 3D generation based on single views. These features showcase the broad applicability of See3D in various 3D creative applications.
The motivation behind this research stems from the limitations of 3D data; traditional 3D data collection processes are time-consuming and costly, while videos, with their multi-view correlations and camera motion information, serve as powerful tools for revealing 3D structures. The solution proposed by See3D includes dataset construction, model training, and a 3D generation framework. The team automatically filtered video data to build the WebVi3D dataset, which encompasses 16 million video clips and 320 million frames of images. The See3D model generates pure 2D visual signals by adding time-dependent noise to masked video data, supporting scalable multi-view diffusion model training, and achieving 3D generation without camera conditions.
The advantages of See3D lie in data scalability, camera controllability, and geometric consistency. Its training data is sourced from a vast array of internet videos, resulting in a significant increase in the scale of the constructed multi-view dataset. The model supports scene generation under any complex camera trajectory while maintaining geometric consistency across frames.
By expanding the dataset scale, See3D provides new insights for the development of 3D generation technology. It is hoped that this work will encourage the 3D research community to focus on large-scale unlabeled camera data, reduce the costs of 3D data collection, and bridge the gap with existing closed-source 3D solutions.
Project Address: https://vision.baai.ac.cn/see3d