Website Master Home (ChinaZ.com) June 14 News: ByteDance has released the new Depth Anything V2 depth model, which has achieved significant performance improvements in the field of monocular depth estimation. Compared to the previous Depth Anything V1, the V2 version offers more refined details and stronger robustness, while also significantly improving efficiency, being over 10 times faster than models based on Stable Diffusion.

image.png

Key Features:

Refined Details: The V2 model has been optimized for detail, providing more refined depth predictions.

High Efficiency and Accuracy: Compared to models built on SD, V2 has significantly improved both efficiency and accuracy.

Multi-Scale Model Support: Models of different scales are provided, with parameters ranging from 25M to 1.3B, to accommodate various application scenarios.

Key Practices: Performance improvements were achieved by replacing real images with synthetic ones, expanding the capacity of the teacher model, and using large-scale pseudo-labeled images to teach the student model.

Three Key Practices for Enhancing Model Performance:

Use of Synthetic Images: Replacing all labeled real images with synthetic images improved the training efficiency of the model.

Expansion of Teacher Model Capacity: By expanding the capacity of the teacher model, the generalization ability of the model was enhanced.

Application of Pseudo-Labeled Images: Using large-scale pseudo-labeled real images as a bridge to teach the student model improved the robustness of the model.

Support for a Wide Range of Application Scenarios:

To meet a wide range of application needs, researchers provided models of different scales and leveraged their generalization capabilities, fine-tuning through depth metrics.

A diverse evaluation benchmark with sparse depth annotations was constructed to promote future research.

Training Method Based on Synthetic and Real Images:

Researchers first trained the largest teacher model on synthetic images, then generated high-quality pseudo-labels for large-scale unlabeled real images, and trained the student model on these pseudo-labeled real images.

The training process used 595K synthetic images and over 62M real pseudo-labeled images.

The launch of the Depth Anything V2 model showcases ByteDance's innovative capabilities in deep learning technology, and its efficient and accurate performance characteristics indicate a broad application potential in the field of computer vision.

Project Address: https://depth-anything-v2.github.io/