Microsoft Launches LLM2CLIP: New AI Technology Supports Image Understanding with Language Models

AIbase基地

Published inAI News · 5 min read · Nov 15, 2024

301

In today's technology field, CLIP (Contrastive Language-Image Pre-training) is an important multimodal foundation model. It combines visual and textual signals into a shared feature space using contrastive learning loss on a large-scale dataset of image-text pairs.

As a retriever, CLIP supports various tasks such as zero-shot classification, detection, segmentation, and image-text retrieval. Additionally, as a feature extractor, it dominates almost all cross-modal representation tasks, including image understanding, video understanding, and text-to-image or video generation. The strength of CLIP lies in its ability to connect images with natural language and capture human knowledge, thanks to its training on large-scale web data containing detailed textual descriptions.

However, CLIP has certain limitations when dealing with long and complex textual descriptions. To overcome this issue, researchers from Microsoft and Tongji University proposed the LLM2CLIP method, aimed at enhancing visual representation learning by integrating large language models (LLMs). This method boldly replaces the original CLIP text encoder, leveraging the rich knowledge of LLMs to improve the performance of CLIP's visual encoder. Research has shown that directly integrating LLMs into CLIP can lead to performance degradation, necessitating a solution to this challenge.

The LLM2CLIP method significantly improves the LLM's ability to separate image captions through the introduction of a "caption contrast fine-tuning" technique, resulting in a notable performance boost.

The researchers conducted fine-tuning experiments using datasets of various sizes, including small CC-3M, medium CC-3M and CC-12M, and large CC-3M, CC-12M, YFCC-15M, and Recaption-1B. The results indicate that models trained with LLM2CLIP outperform traditional CLIP and EVA models in image-to-text and text-to-image retrieval tasks.

By combining with models like Llava1.5 for multimodal training, LLM2CLIP has excelled in nearly all benchmark tests, particularly in handling long and short text retrieval tasks, improving upon the previous EVA02 model's performance by 16.5%. This innovative approach not only transforms CLIP from merely processing English data into a powerful cross-language model but also lays the groundwork for future research on CLIP training.

Model: https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c

Code: https://github.com/microsoft/LLM2CLIP/

Paper: https://arxiv.org/abs/2411.04997

Key Points:

🌟 LLM2CLIP is an innovative method proposed through the collaboration between Microsoft and Tongji University, aimed at enhancing the performance of CLIP's visual encoder by replacing its text encoder.

📈 This method significantly strengthens the model's capability in image and text matching through "caption contrast fine-tuning," surpassing existing state-of-the-art models.

🌐 Experiments on multiple datasets show that LLM2CLIP outperforms traditional models in long and short text retrieval tasks, advancing the development of cross-language models.

Berkeley Unveils TULIP: A Breakthrough in Vision-Language AI, Significantly Outperforming Existing Technologies

Researchers at the University of California, Berkeley have recently released their latest research achievement – the TULIP (Towards Unified Language-Image Pretraining) model. This model aims to enhance the performance of vision-language pre-training, particularly in visually-centric tasks requiring high-fidelity understanding, overcoming the limitations of existing contrastive learning models (such as CLIP). TULIP integrates innovative techniques such as generative data augmentation, enhanced contrastive learning, and reconstruction regularization.

AI Reshapes Content Creation: OpusClip Secures $20 Million in Funding to Build a Video Creation Superbrain

Artificial intelligence is disrupting the content creation industry at an unprecedented rate. Once the exclusive domain of resource-rich film studios, production houses, and media giants, high-quality content production is being democratized. Recently, AI-powered video editing platform OpusClip secured $20 million in funding from SoftBank Vision Fund 2, boosting its valuation to $215 million. This significant investment clearly signals a rewriting of the rules for digital content creation and distribution. As an AI video editing platform, OpusClip...

Yue's Dark Side Launches Kimi-Latest: Experience the Latest Kimi Model First-Hand

On February 18, Beijing Yue's Dark Side Technology Co., Ltd. announced the launch of the latest model, Kimi-Latest, on the Kimi Open Platform, aimed at providing developers and enterprise users with more powerful and stable AI generation capabilities. Since the public test of the Kimi Open Platform on January 31, 2024, the moonshot-v1 series models have been the core support for the Kimi Smart Assistant.

Tencent Releases New Patent on Training Large Language Models to Enhance Generalization and Accuracy

Recently, Tencent Technology (Shenzhen) Co., Ltd. published a patent regarding a training method and related equipment for large language models on the Tianyancha app. The patent is titled 'Training Method, Device, Computer Equipment, and Storage Medium for Large Language Models' and aims to enhance the learning capacity and accuracy of large language models through innovative training methods. In the training process of large language models, traditional methods often rely on a single text summary, which may lead to model overfitting and negatively impact the accuracy and diversity of generated content. However, Tencent's new...

MIT and DeepMind Research Reveals Why Visual Language Models Struggle with Negation

In multimodal tasks, visual language models (VLMs) play a crucial role, such as in image retrieval, image captioning, and medical diagnosis. These models aim to align visual data with language data for more efficient information processing. However, current VLMs still face significant challenges in understanding negation. Negation is critical in many applications, such as distinguishing between "a room without windows" and "a room with windows." Despite significant progress made by VLMs, existing models fall short when it comes to handling negated statements.

CMU and Meta Join Forces to Unveil VQAScore! A Single Question Addresses Evaluation of Text-to-Image Models, Achieving Accuracy that Far Surpasses Traditional Methods!

The rapid development of generative AI has raised the challenge of comprehensively evaluating its performance. A plethora of models have emerged, each showcasing increasingly impressive results. However, the pressing question remains: how do we assess the effectiveness of these text-to-image models? Traditional evaluation methods often rely on subjective visual assessments or simple metrics like CLIPScore, which frequently fail to capture the intricate details found in complex text prompts, such as relationships between objects and logical reasoning. This leads to inaccurate evaluation outcomes for many text-to-image models.

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Microsoft Launches LLM2CLIP: New AI Technology Supports Image Understanding with Language Models

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Berkeley Unveils TULIP: A Breakthrough in Vision-Language AI, Significantly Outperforming Existing Technologies

AI Reshapes Content Creation: OpusClip Secures $20 Million in Funding to Build a Video Creation Superbrain

Yue's Dark Side Launches Kimi-Latest: Experience the Latest Kimi Model First-Hand

YouTube Introduces New AI Video Generation Feature Supporting Video Clips to Shorts

Tencent Releases New Patent on Training Large Language Models to Enhance Generalization and Accuracy

Reports: ByteDance Plans to Invest $12 Billion in AI Chips by 2025

MIT and DeepMind Research Reveals Why Visual Language Models Struggle with Negation

Vidu 2.0 Official Launch: Generate Short Clips in 10 Seconds with Improved Subject Consistency

CMU and Meta Join Forces to Unveil VQAScore! A Single Question Addresses Evaluation of Text-to-Image Models, Achieving Accuracy that Far Surpasses Traditional Methods!

AI Gaming Session: Claude Challenges the "Paperclip Clicker", Showcasing Amazing Abilities and Unexpected Flaws