Driven by multimodal large language models (MLLMs), revolutionary advancements have been made in image and video-related tasks, including visual question answering, narrative generation, and interactive editing. However, achieving fine-grained understanding of video content still faces significant challenges. These challenges involve tasks such as pixel-level segmentation, tracking with language descriptions, and visual question answering based on specific video prompts.

image.png

Despite the impressive performance of current state-of-the-art video perception models in segmentation and tracking tasks, they still fall short in open language understanding and dialogue capabilities. Moreover, video MLLMs perform well in video understanding and question answering tasks but struggle with perceptual tasks and visual prompts.

There are mainly two existing solutions: multimodal large language models (MLLMs) and reference segmentation systems. MLLMs were initially focused on improving multimodal fusion methods and feature extractors, gradually evolving into frameworks for instruction tuning on LLMs, such as LLaVA. Recently, researchers have attempted to unify image, video, and multi-image analysis into a single framework, like LLaVA-OneVision. At the same time, reference segmentation systems have also undergone transformations from basic fusion modules to integrated segmentation and tracking. However, these solutions still lack comprehensive integration of perceptual and language understanding capabilities.

Researchers from UC Merced, ByteDance's seed team, Wuhan University, and Peking University have proposed Sa2VA, a groundbreaking unified model designed for dense foundational understanding of images and videos. This model supports a wide range of image and video tasks by minimizing one-time instruction tuning, overcoming the limitations of existing multimodal large language models.

Sa2VA innovatively integrates SAM-2 with LLaVA, unifying text, images, and videos into a shared LLM token space. Additionally, researchers introduced a comprehensive automatically annotated dataset named Ref-SAV, containing object expressions from over 72K complex video scenes and 2K manually verified video objects to ensure robust benchmarking capabilities.

The architecture of Sa2VA mainly consists of two parts: a LLaVA-like model and SAM-2, utilizing a novel decoupled design. The LLaVA-like component includes a visual encoder for processing images and videos, a visual projection layer, and an LLM for text token prediction. The system employs a unique decoupling approach that allows SAM-2 to operate alongside the pre-trained LLaVA model without direct token exchange, thus maintaining computational efficiency and allowing plug-and-play functionality with various pre-trained MLLMs.

Research results indicate that Sa2VA achieved state-of-the-art results in reference segmentation tasks, with its Sa2VA-8B model scoring 81.6, 76.2, and 78.9 for cIoU on RefCOCO, RefCOCO+, and RefCOCOg respectively, surpassing previous systems like GLaMM-7B. In terms of dialogue capabilities, Sa2VA excelled with scores of 2128, 81.6, and 75.1 on MME, MMbench, and SEED-Bench respectively.

Moreover, Sa2VA's performance in video benchmarking significantly exceeded the previous state-of-the-art VISA-13B, demonstrating its efficiency and effectiveness in image and video understanding tasks.

Paper: https://arxiv.org/abs/2501.04001

Model: https://huggingface.co/collections/ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

Highlights:

🌟 Sa2VA is a novel unified AI framework that achieves deep understanding of images and videos, overcoming the limitations of existing multimodal models.

📊 The model has achieved state-of-the-art results in several benchmarks, such as reference segmentation and dialogue capabilities, showcasing excellent performance.

🧠 Sa2VA's design effectively integrates visual and language understanding capabilities through a decoupled approach, supporting a wide range of image and video tasks.