In the field of video analysis, the persistence of objects is an important clue for humans to understand that objects still exist even when completely occluded. However, current object segmentation methods mostly focus on visible (modal) objects and lack the handling of non-modal (visible + invisible) objects.

To address this issue, researchers have proposed a two-stage method based on diffusion priors called Diffusion-Vas, aimed at improving the performance of non-modal segmentation and content completion in videos. This method can track specified targets in the video and then use a diffusion model to complete the occluded parts.

image.png

The first stage of this method involves generating non-modal masks for objects. Researchers infer the occlusion of object boundaries by combining visible mask sequences with pseudo-depth maps. The pseudo-depth maps are obtained through monocular depth estimation of RGB video sequences. The goal of this stage is to determine which parts of the objects may be occluded in the scene, thereby expanding the complete outline of the objects.

Based on the non-modal masks generated in the first stage, the second stage is responsible for content completion in the occluded areas. The research team utilizes modal RGB content and applies conditional generative models to fill in the occluded regions, ultimately generating complete non-modal RGB content. The entire process employs a conditional latent diffusion framework with a 3D UNet backbone, ensuring high fidelity of the generated results.

To validate its effectiveness, the research team benchmarked the new method on four datasets, and the results indicated that it improved the accuracy of non-modal segmentation in occluded areas by up to 13% compared to various advanced methods. Particularly in complex scenes, the research method demonstrated excellent robustness, effectively coping with strong camera motion and frequent complete occlusions.

This research not only enhances the accuracy of video analysis but also provides a new perspective on understanding the existence of objects in complex scenes. In the future, this technology is expected to be applied in various fields such as autonomous driving and surveillance video analysis.

Project: https://diffusion-vas.github.io/

Key Points:  

🌟 The research proposes a new method for achieving non-modal segmentation and content completion in videos through diffusion priors.  

🖼️ The method is divided into two stages: first generating non-modal masks, and then completing the content of occluded areas.  

📊 In multiple benchmark tests, this method significantly improved the accuracy of non-modal segmentation, especially performing well in complex scenes.