Recently, research teams from several Chinese institutions have released a massive multimodal dataset named Infinity-MM, and have trained an outstanding AI model called Aquila-VL-2B based on this dataset. This breakthrough has injected new momentum into the development of multimodal AI.
The Infinity-MM dataset is astonishingly large, containing four major categories of data: 10 million image descriptions, 24.4 million general visual instruction data, 6 million high-quality selected instruction data, and 3 million data generated by AI models like GPT-4. The research team used the open-source AI model RAM++ for image analysis and information extraction, and ensured the quality and diversity of the generated data through a unique six-category classification system.
Image source note: The image was generated by AI, provided by the image authorization service Midjourney
In terms of model architecture, Aquila-VL-2B is built on LLaVA-OneVision, integrating the Qwen-2.5 language model and SigLIP image processing technology. The research team adopted a four-stage progressive training method: starting from basic image-text correlation learning, gradually transitioning to general visual tasks, specific instruction processing, and finally incorporating synthetic data, while progressively increasing the upper limit of image resolution.
Despite having only 2 billion parameters, Aquila-VL-2B performed exceptionally well in various benchmark tests. It achieved the best score of 54.9% in the multimodal understanding test MMStar, and an impressive 59% in the mathematical ability test MathVista, significantly outperforming similar systems. In general image understanding tests, the model achieved excellent scores of 43% on HallusionBench and 75.2% on MMBench.
The study found that the introduction of synthetic data significantly contributed to the model's performance improvement. Experiments showed that without these additional data, the model's performance dropped by an average of 2.4%. From the third stage, Aquila-VL-2B significantly outperformed reference models such as InternVL2-2B and Qwen2VL-2B, especially in the fourth stage, where performance improved more noticeably with increased data volume.
It is worth noting that the research team has made the dataset and model available to the research community, which will greatly promote the development of multimodal AI technology. The model was not only trained on Nvidia A100 GPUs but also supports China's self-developed chips, demonstrating strong hardware adaptability.