Diffusion-Vas: Tracking Video Objects with Occlusion Completion

AIbase基地

Published inAI News · 5 min read · Dec 17, 2024

343

In the field of video analysis, the persistence of objects is an important clue for humans to understand that objects still exist even when completely occluded. However, current object segmentation methods mostly focus on visible (modal) objects and lack the handling of non-modal (visible + invisible) objects.

To address this issue, researchers have proposed a two-stage method based on diffusion priors called Diffusion-Vas, aimed at improving the performance of non-modal segmentation and content completion in videos. This method can track specified targets in the video and then use a diffusion model to complete the occluded parts.

The first stage of this method involves generating non-modal masks for objects. Researchers infer the occlusion of object boundaries by combining visible mask sequences with pseudo-depth maps. The pseudo-depth maps are obtained through monocular depth estimation of RGB video sequences. The goal of this stage is to determine which parts of the objects may be occluded in the scene, thereby expanding the complete outline of the objects.

Based on the non-modal masks generated in the first stage, the second stage is responsible for content completion in the occluded areas. The research team utilizes modal RGB content and applies conditional generative models to fill in the occluded regions, ultimately generating complete non-modal RGB content. The entire process employs a conditional latent diffusion framework with a 3D UNet backbone, ensuring high fidelity of the generated results.

To validate its effectiveness, the research team benchmarked the new method on four datasets, and the results indicated that it improved the accuracy of non-modal segmentation in occluded areas by up to 13% compared to various advanced methods. Particularly in complex scenes, the research method demonstrated excellent robustness, effectively coping with strong camera motion and frequent complete occlusions.

This research not only enhances the accuracy of video analysis but also provides a new perspective on understanding the existence of objects in complex scenes. In the future, this technology is expected to be applied in various fields such as autonomous driving and surveillance video analysis.

Project: https://diffusion-vas.github.io/

Key Points:
🌟 The research proposes a new method for achieving non-modal segmentation and content completion in videos through diffusion priors.
🖼️ The method is divided into two stages: first generating non-modal masks, and then completing the content of occluded areas.
📊 In multiple benchmark tests, this method significantly improved the accuracy of non-modal segmentation, especially performing well in complex scenes.

Object Segmentation Diffusion Model Video Analysis Diffusion-Vas

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

AI Daily Report - June 30th: Baidu Open Sources the WENXIN Large Model 4.5 Series; Tongyi Qianwen Multimodal Generation Model Qwen VLo

Welcome to the AIbase [AI Daily Report] section! Spend three minutes a day to learn about the latest AI events, helping you understand AI industry trends and innovative AI product applications. For more AI news, visit: https://www.aibase.com/zh1. Baidu officially releases the WENXIN Large Model 4.5 series and fully opens it to the public, featuring ten new models with various parameter configurations. These models are trained and inferred using the PaddlePaddle framework, achieving a FLOPs utilization rate of 47%, and perform well in multi-modal text tasks.

Jun 30, 2025

200

Baidu Launches the WENXIN Large Model 4.5 Series Open Source, Sparking a New Wave in the Domestic Large Model Market!

Recently, Baidu officially announced the open-source release of its WENXIN Large Model 4.5 series, launching a total of ten models, including mixed expert (MoE) models with 47B and 3B activated parameters, as well as dense models with 0.3B parameters. This open-source initiative not only fully publicizes the pre-trained weights but also provides inference code, marking a significant advancement for Baidu in the field of large models. These newly released models can be downloaded and deployed on platforms such as PaddlePaddle Starry Sky Community and Hugging Face. Additionally, Baidu Intelligent Cloud's Qianfan Large Model Platform also provides

Jun 30, 2025

250

AI Parenting Video: How to Earn Over 600 Per Day Using Trending Topics and AI Tools - Detailed Step-by-Step Breakdown

Jun 30, 2025

AI Parenting Video: How to Earn Over 600 Per Day Using Trending Topics and AI Tools - Detailed Step-by-Step Breakdown

Monetization Idea: Use AI tools to create parenting conversation videos and post them on video platforms. Monetize through traffic sharing, account sales, and tutorial sales. Suitable for parents with parenting experience, young people who enjoy video creation, and individuals with basic knowledge of AI technology. Difficulty level is moderate, requiring proficiency in using AI tools and video editing software. Operation Process Method ** Step 1: Find Benchmark Videos ** Open the Qing Dou mini program, browse related parenting videos. Find the videos you are interested in and extract their scripts. ** Step 2:

Jun 30, 2025

110

Baidu Makes a Major Open-Source Release of the ERNIE Bot 4.5 Series with Ten New Models Unveiled!

Baidu officially released the ERNIE Bot 4.5 series models and made them fully open source. Users can experience this latest open-source technology immediately through ERNIE Bot (https://yiyan.baidu.com). This series includes multiple parameter configurations, such as Mixture of Experts (MoE) models with activated parameters of 47B and 3B, as well as dense models designed with 0.3B parameters, totaling ten different models. In terms of training and inference, the ERNIE 4.5 series models use PaddlePaddle deep learning.

Jun 30, 2025

600

Gemini2.5Pro API Returns Free, Developer Community Responds Enthusiastically

Recently, Google announced that the API of its flagship AI model, Gemini2.5Pro, has been reintroduced to the free tier of Google AI Studio. This news has triggered widespread attention and enthusiastic discussions within the developer community. According to AIbase, this move marks another important advancement in Google's efforts to popularize AI technology, offering developers lower barriers to innovation. As the most advanced AI model from Google so far, Gemini2.5Pro is known for its exceptional multimodal capabilities and strong reasoning power.

Jun 30, 2025

270

Memory Optimization! NVIDIA DLSS 4 Makes Games Smoother, Reducing VRAM by 20% with Transformer Model

Jun 30, 2025

100

Alibaba Ovis-U1 Launches with a Bang: A Multi-Modal AI All-in-One, Open Source Empowers Global Developers

On June 29, 2025, the Alibaba International AI Team officially released the new multi-modal large model **Ovis-U1**, marking another major breakthrough in the field of multi-modal artificial intelligence. As the latest masterpiece of the Ovis series, Ovis-U1 integrates multi-modal understanding, image generation, and image editing functions, demonstrating powerful cross-modal processing capabilities, providing new possibilities for developers, researchers, and industry applications. This is a detailed report on Ovis-U1 by AIbase. Ovis-U1

Jun 30, 2025

680

Tencent Open Sources Hunyuan-A13B: An AI Model with Small Size and Great Intelligence

Jun 30, 2025

860

Surprising Similarities Between Large Language Model Search Optimization and Traditional SEO Strategies

Recently, ERGO Innovation Lab and ECODYNAMICS conducted a study focusing on how insurance-related content is displayed in AI-driven search. The research analyzed over 33,000 AI search results and 600 websites, exploring the preferences of large language models (LLMs) such as ChatGPT when processing this content. The study found that LLMs tend to prioritize content that is easy to read, well-structured, and trustworthy, which closely aligns with traditional SEO strategies.

Jun 30, 2025

100

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Diffusion-Vas: Tracking Video Objects with Occlusion Completion

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily Report - June 30th: Baidu Open Sources the WENXIN Large Model 4.5 Series; Tongyi Qianwen Multimodal Generation Model Qwen VLo

Baidu Launches the WENXIN Large Model 4.5 Series Open Source, Sparking a New Wave in the Domestic Large Model Market!

AI Parenting Video: How to Earn Over 600 Per Day Using Trending Topics and AI Tools - Detailed Step-by-Step Breakdown

AI Parenting Video: How to Earn Over 600 Per Day Using Trending Topics and AI Tools - Detailed Step-by-Step Breakdown

Baidu Makes a Major Open-Source Release of the ERNIE Bot 4.5 Series with Ten New Models Unveiled!

Gemini2.5Pro API Returns Free, Developer Community Responds Enthusiastically

Memory Optimization! NVIDIA DLSS 4 Makes Games Smoother, Reducing VRAM by 20% with Transformer Model

Alibaba Ovis-U1 Launches with a Bang: A Multi-Modal AI All-in-One, Open Source Empowers Global Developers

Tencent Open Sources Hunyuan-A13B: An AI Model with Small Size and Great Intelligence

Surprising Similarities Between Large Language Model Search Optimization and Traditional SEO Strategies