MIT and DeepMind Research Reveals Why Visual Language Models Struggle with Negation

AIbase基地

Published inAI News · 5 min read · Jan 20, 2025

137

In multimodal tasks, Vision-Language Models (VLMs) play a crucial role in areas such as image retrieval, image captioning, and medical diagnosis. The goal of these models is to align visual data with language data for more efficient information processing. However, current VLMs still face significant challenges in understanding negation.

Negation is vital in many applications, such as distinguishing between "a room without windows" and "a room with windows." Despite significant progress made by VLMs, their performance drops considerably when handling negative statements. This limitation is particularly important in high-stakes fields such as security monitoring and healthcare.

Existing VLMs, like CLIP, use a shared embedding space to align visual and textual representations. While these models excel in tasks like cross-modal retrieval and image captioning, they struggle with negation statements. The root of this issue lies in the bias of the pre-training data, which is predominantly composed of affirmative examples, leading the model to treat negation as synonymous with affirmation. Consequently, current benchmarks like CREPE and CC-Neg use simple template examples that fail to reflect the richness and depth of negation in natural language. This poses a significant challenge for VLMs when performing precise language understanding tasks, such as querying complex conditions in medical imaging databases.

To address these issues, researchers from MIT, Google DeepMind, and the University of Oxford proposed the NegBench framework to evaluate and improve VLMs' understanding of negation. This framework assesses two fundamental tasks: Retrieval-Neg, which tests the model's ability to retrieve images based on affirmative and negative descriptions; and MCQ-Neg, which evaluates the model's performance in subtle understanding. NegBench utilizes a large synthetic dataset, including CC12M-NegCap and CC12M-NegMCQ, containing millions of captions that cover a rich variety of negation scenarios, thereby enhancing the training and evaluation of the models.

By combining real and synthetic datasets, NegBench effectively overcomes the limitations of existing models, significantly improving their performance and generalization capabilities. Fine-tuned models show marked improvements in both retrieval and understanding tasks, particularly with a 10% increase in recall when handling negative queries. In multiple-choice tasks, accuracy increased by up to 40%, demonstrating a greatly enhanced ability to distinguish between subtle affirmative and negative captions.

The introduction of NegBench fills a critical gap in VLMs' understanding of negation, paving the way for the development of more robust artificial intelligence systems, particularly in crucial fields such as medical diagnosis and semantic content retrieval.

Paper: https://arxiv.org/abs/2501.09425

Code: https://github.com/m1k2zoo/negbench

Key Points:
🌟 Researchers reveal the shortcomings of Vision-Language Models in understanding negation, primarily stemming from biases in training data.
📈 The NegBench framework significantly enhances model performance in retrieval and understanding tasks by introducing rich negation examples.
🔍 Fine-tuned models show significant improvements in accuracy and recall when handling negative queries, advancing the progress of artificial intelligence systems.

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

Mistral AI launched the Devstral2507 series with two AI models: the open-source Devstral Small1.1 (24 billion parameters, SWE-Bench score of 53.6%) and the enterprise version Devstral Medium2507 (score of 61.6%). Small1.1 supports a 128k context window and local deployment, while Medium2507 outperforms some commercial models. Both are optimized for code reasoning and program synthesis, and support integration with agent frameworks.

AI Daily: xAI Shockingly Launches Grok4; Microsoft Opensources New Phi-4-mini Version; Shanghai has Cumulatively 82 Large Models Passed Filing

1. xAI launches Grok4 with enhanced math/coding capabilities; 2. Microsoft open-sources efficient Phi-4-mini for edge devices; 3. Shanghai approves 82 specialized AI models; 4. Hugging Face releases Reachy Mini robot; 5. Perplexity debuts Comet AI browser; 6. OpenAI plans first open-weight model; 7. Google releases GPU-friendly MedGemma; 8. OpenAI acquires AI hardware firm for $6.5B.....

Shanghai has completed the filing of 82 large models

At the 2025 World Artificial Intelligence Conference, it was revealed that Shanghai has filed 82 large models and is actively promoting AI demonstration applications in key industries such as manufacturing and finance. Xuhui Moshu Space and Pudong Moli Community have become industrial carriers, gathering 500 and 200 AI companies respectively. Shanghai has established a full-cycle financing support system from the early stages to the mature stage through national and municipal artificial intelligence funds, with a focus on key areas such as computing power and language data.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

MIT and DeepMind Research Reveals Why Visual Language Models Struggle with Negation

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

Google Veo3 Adds Image-to-Video Feature, Users Create Over 40 Million Videos Within Seven Weeks

Personification of Large AI Models: Grok 4 and Empathy with Musk?

Kling AI Releases KTu 2.1 Model: Significant Improvement in Image Generation Capabilities, Supports 180 Styles

ChatGPT Business Recommendations May Pose Risks Due to Unreliable Information Sources, Experts Urge Users to Use with Caution

AI Daily: xAI Shockingly Launches Grok4; Microsoft Opensources New Phi-4-mini Version; Shanghai has Cumulatively 82 Large Models Passed Filing

Shanghai has completed the filing of 82 large models

OpenAI Plans to Release Open-Weight Models, Breaking the Closed-Source Convention

YouTube Cracks Down on AI-Generated Misinformation to Curb the Spread of False Information

Ali Open Sources WebSailor with Strong Reasoning and Retrieval Capabilities