Open-Source Vision-Language Representation Learning Model RWKV-CLIP

AIbase基地

Published inAI News · 3 min read · Jul 22, 2024

157

DeepSeek has open-sourced the RWKV-CLIP model, a visual-language representation learner that combines the strengths of both Transformer and RNN architectures. The model has significantly enhanced performance on vision and language tasks by pre-training on image-text pairs sourced from web data, expanding its dataset with information obtained from various websites.

To address the issue of noisy data and improve data quality, the research team introduced a diverse description generation framework. This framework leverages large language models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection labels.

The RWKV-CLIP model employs a dual-tower architecture, integrating the efficient parallel training of Transformers with the efficient inference of RNNs. It consists of multiple stacked spatial mixing and channel mixing modules, which enable in-depth processing of input images and texts. During the spatial mixing phase, the model utilizes an attention mechanism for global linear complexity computation, enhancing interactions at the channel level. The channel mixing phase further refines the feature representation. To enhance input robustness, RWKV-CLIP randomly selects from original texts, synthetic captions, or generated descriptions as the text input.

WeChat Screenshot_20240722083639.png

Experimental results demonstrate that RWKV-CLIP achieves state-of-the-art performance on multiple downstream tasks, including linear probing, zero-shot classification, and zero-shot image-text retrieval. Compared to baseline models, RWKV-CLIP shows a significant performance improvement.

The cross-modal analysis of the RWKV-CLIP model reveals that the learned representations exhibit clearer discriminability within the same modality and a closer distance in the image-text modality space, indicating superior cross-modal alignment performance.

Model URL: https://wisemodel.cn/models/deepglint/RWKV-CLIP

Transformer and RNN Visual Language Representation Learner Diverse Description Generation Framework RWKV-CLIP Model

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Open-Source Vision-Language Representation Learning Model RWKV-CLIP

AIbase基地

This article is from AIbase Daily