Aquila-VL-2B-llava-qwen

A visual-language model that intelligently processes both image and text information.

CommonProductImageVisual Language ModelMultimodal
The Aquila-VL-2B model is a visual-language model (VLM) trained on the LLava-one-vision framework, utilizing the Qwen2.5-1.5B-instruct model as the language model (LLM) and the siglip-so400m-patch14-384 as the visual tower. This model was trained on the self-constructed Infinity-MM dataset, which contains approximately 40 million image-text pairs, combining open-source data collected from the internet with synthetic instruction data generated using open-source VLM models. The open-source nature of the Aquila-VL-2B model aims to advance multimodal performance, especially in the integrated processing of image and text.
Visit

Aquila-VL-2B-llava-qwen Visit Over Time

Monthly Visits

19075321

Bounce Rate

45.07%

Page per Visit

5.5

Visit Duration

00:05:32

Aquila-VL-2B-llava-qwen Visit Trend

Aquila-VL-2B-llava-qwen Visit Geography

Aquila-VL-2B-llava-qwen Traffic Sources

Aquila-VL-2B-llava-qwen Alternatives