DeepSeek has open-sourced the RWKV-CLIP model, a visual-language representation learner that combines the strengths of both Transformer and RNN architectures. The model has significantly enhanced performance on vision and language tasks by pre-training on image-text pairs sourced from web data, expanding its dataset with information obtained from various websites.
To address the issue of noisy data and improve data quality, the research team introduced a diverse description generation framework. This framework leverages large language models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection labels.
The RWKV-CLIP model employs a dual-tower architecture, integrating the efficient parallel training of Transformers with the efficient inference of RNNs. It consists of multiple stacked spatial mixing and channel mixing modules, which enable in-depth processing of input images and texts. During the spatial mixing phase, the model utilizes an attention mechanism for global linear complexity computation, enhancing interactions at the channel level. The channel mixing phase further refines the feature representation. To enhance input robustness, RWKV-CLIP randomly selects from original texts, synthetic captions, or generated descriptions as the text input.
Experimental results demonstrate that RWKV-CLIP achieves state-of-the-art performance on multiple downstream tasks, including linear probing, zero-shot classification, and zero-shot image-text retrieval. Compared to baseline models, RWKV-CLIP shows a significant performance improvement.
The cross-modal analysis of the RWKV-CLIP model reveals that the learned representations exhibit clearer discriminability within the same modality and a closer distance in the image-text modality space, indicating superior cross-modal alignment performance.
Model URL: https://wisemodel.cn/models/deepglint/RWKV-CLIP