In today's technology field, CLIP (Contrastive Language-Image Pre-training) is an important multimodal foundation model. It combines visual and textual signals into a shared feature space using contrastive learning loss on a large-scale dataset of image-text pairs.

As a retriever, CLIP supports various tasks such as zero-shot classification, detection, segmentation, and image-text retrieval. Additionally, as a feature extractor, it dominates almost all cross-modal representation tasks, including image understanding, video understanding, and text-to-image or video generation. The strength of CLIP lies in its ability to connect images with natural language and capture human knowledge, thanks to its training on large-scale web data containing detailed textual descriptions.

However, CLIP has certain limitations when dealing with long and complex textual descriptions. To overcome this issue, researchers from Microsoft and Tongji University proposed the LLM2CLIP method, aimed at enhancing visual representation learning by integrating large language models (LLMs). This method boldly replaces the original CLIP text encoder, leveraging the rich knowledge of LLMs to improve the performance of CLIP's visual encoder. Research has shown that directly integrating LLMs into CLIP can lead to performance degradation, necessitating a solution to this challenge.

image.png

The LLM2CLIP method significantly improves the LLM's ability to separate image captions through the introduction of a "caption contrast fine-tuning" technique, resulting in a notable performance boost.

The researchers conducted fine-tuning experiments using datasets of various sizes, including small CC-3M, medium CC-3M and CC-12M, and large CC-3M, CC-12M, YFCC-15M, and Recaption-1B. The results indicate that models trained with LLM2CLIP outperform traditional CLIP and EVA models in image-to-text and text-to-image retrieval tasks.

image.png

By combining with models like Llava1.5 for multimodal training, LLM2CLIP has excelled in nearly all benchmark tests, particularly in handling long and short text retrieval tasks, improving upon the previous EVA02 model's performance by 16.5%. This innovative approach not only transforms CLIP from merely processing English data into a powerful cross-language model but also lays the groundwork for future research on CLIP training.

Model: https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c

Code: https://github.com/microsoft/LLM2CLIP/

Paper: https://arxiv.org/abs/2411.04997

Key Points:

🌟 LLM2CLIP is an innovative method proposed through the collaboration between Microsoft and Tongji University, aimed at enhancing the performance of CLIP's visual encoder by replacing its text encoder.

📈 This method significantly strengthens the model's capability in image and text matching through "caption contrast fine-tuning," surpassing existing state-of-the-art models.

🌐 Experiments on multiple datasets show that LLM2CLIP outperforms traditional models in long and short text retrieval tasks, advancing the development of cross-language models.