Alibaba's Damo Academy recently launched a multimodal large language model called Valley2. This model is designed based on e-commerce scenarios, aiming to enhance performance across various fields and expand the application boundaries of e-commerce and short video scenarios through a scalable visual-language architecture. Valley2 uses Qwen2.5 as its LLM backbone, paired with the SigLIP-384 visual encoder, and combines MLP layers and convolution for efficient feature transformation. Its innovation lies in the introduction of a large visual vocabulary, convolutional adapters (ConvAdapter), and the Eagle module, which enhances flexibility in processing diverse real-world inputs and improves training and inference efficiency.

WeChat Screenshot_20250115084005.png

The data for Valley2 consists of OneVision-style data, data tailored for e-commerce and short video domains, and chain-of-thought (CoT) data for solving complex problems. The training process is divided into four stages: text-visual alignment, high-quality knowledge learning, instruction fine-tuning, and post-training with chain-of-thought. In experiments, Valley2 performed excellently on multiple public benchmarks, scoring particularly high on benchmarks such as MMBench, MMStar, and MathVista, and also surpassed other models of similar scale in the Ecom-VQA benchmark test.

In the future, Alibaba's Damo Academy plans to release a versatile model that includes text, image, video, and audio modalities, and introduce a multimodal embedding training method based on Valley to support downstream retrieval and detection applications.

The launch of Valley2 marks a significant advancement in the field of multimodal large language models, demonstrating the potential to enhance model performance through structural improvements, dataset construction, and optimization of training strategies.

Model link:

https://www.modelscope.cn/models/bytedance-research/Valley-Eagle-7B

Code link:

https://github.com/bytedance/Valley

Paper link:

https://arxiv.org/abs/2501.05901