On March 6, 2025, the Beijing Academy of Artificial Intelligence (BAAI) announced the open-sourcing of its multi-modal vector model, BGE-VL. This achievement marks a significant breakthrough in the field of multi-modal retrieval.
The BGE-VL model achieves state-of-the-art results in various multi-modal retrieval tasks, including image-text retrieval and compositional image retrieval, significantly improving the performance of multi-modal retrieval.
BGE-VL's development is based on the large-scale synthetic dataset, MegaPairs. This dataset efficiently mines multi-modal triplet data from a massive image-text corpus by combining multi-modal representation models, large multi-modal models, and large language models. This method not only boasts excellent scalability, enabling the cost-effective and continuous generation of diverse, high-quality data, but also significantly improves data quality. Compared to traditional manually annotated data, MegaPairs achieves superior training results with only 1/70th of the data volume.
Technically, MegaPairs' construction involves two key steps: first, diverse image pairs are mined from the image dataset using various similarity models; second, open-domain retrieval instructions are synthesized using open-source multi-modal large models and large language models. Through this method, MegaPairs can scalably generate large-scale, high-quality, and diverse multi-modal retrieval instruction datasets without human intervention. The released version includes 26 million samples, providing rich data support for training multi-modal retrieval models.
Based on the MegaPairs dataset, the BAAI BGE team trained three multi-modal retrieval models of different sizes: BGE-VL-Base, BGE-VL-Large, and BGE-VL-MLLM. These models demonstrate leading performance far exceeding previous methods across multiple tasks. In the 36 multi-modal embedding evaluation tasks of the Massive Multimodal Embedding Benchmark (MMEB), BGE-VL achieved optimal performance in both zero-shot and supervised fine-tuning, demonstrating its excellent task generalization ability.
In compositional image retrieval tasks, BGE-VL sets a new benchmark on the CIRCO evaluation set, significantly surpassing comparative baselines such as Google's MagicLens series and NVIDIA's MM-Embed. BGE-VL-MLLM improves by 8.1 percentage points over the previous state-of-the-art model, while the BGE-VL-Base model surpasses multi-modal retrievers based on other large model backbones with less than 1/50th of the parameters.
Furthermore, the study shows that the MegaPairs dataset possesses excellent scalability and efficiency. As the data scale increases, the BGE-VL model exhibits a consistent performance growth trend. Compared to the state-of-the-art model Google MagicLens trained on 37M closed-source data, MegaPairs achieves significant performance advantages with only 1/70th of the data scale (0.5M).
Project Homepage:
https://github.com/VectorSpaceLab/MegaPairs
Model Address: