Aquila-VL-2B-llava-qwen
A visual-language model that intelligently processes both image and text information.
CommonProductImageVisual Language ModelMultimodal
The Aquila-VL-2B model is a visual-language model (VLM) trained on the LLava-one-vision framework, utilizing the Qwen2.5-1.5B-instruct model as the language model (LLM) and the siglip-so400m-patch14-384 as the visual tower. This model was trained on the self-constructed Infinity-MM dataset, which contains approximately 40 million image-text pairs, combining open-source data collected from the internet with synthetic instruction data generated using open-source VLM models. The open-source nature of the Aquila-VL-2B model aims to advance multimodal performance, especially in the integrated processing of image and text.
Aquila-VL-2B-llava-qwen Visit Over Time
Monthly Visits
19075321
Bounce Rate
45.07%
Page per Visit
5.5
Visit Duration
00:05:32