Recently, research teams from multiple Chinese institutions have successfully created the "Infinity-MM" dataset, one of the largest publicly available multimodal AI datasets to date, and concurrently trained an outstanding small new model — Aquila-VL-2B.
The dataset primarily includes four categories of data: 10 million image descriptions, 24.4 million general visual instruction data, 6 million high-quality curated instruction data, and 3 million data entries generated by GPT-4 and other AI models.
In terms of generation, the research team utilized existing open-source AI models. Initially, the RAM++ model analyzes images and extracts key information, subsequently generating relevant questions and answers. Additionally, the team developed a special classification system to ensure the quality and diversity of the generated data.
This synthetic data generation method employs a multi-level processing approach, combining RAM++ and MiniCPM-V models, through image recognition, instruction classification, and response generation, providing precise training data for AI systems.
The Aquila-VL-2B model is based on the LLaVA-OneVision architecture, using Qwen-2.5 as the language model and SigLIP for image processing. The model's training is divided into four stages, progressively increasing in complexity. In the first stage, the model learns basic image-text associations; subsequent stages include general visual tasks, specific instruction execution, and finally, the integration of synthetically generated data. Image resolution is also gradually enhanced during training.
In tests, Aquila-VL-2B, with only 2 billion parameters, achieved the best score of 54.9% in the multimodal MMStar benchmark test. Moreover, in mathematical tasks, the model performed exceptionally well, scoring 59% on the MathVista test, far exceeding similar systems.
In general image understanding tests, Aquila-VL-2B also performed excellently, with a HallusionBench score of 43% and an MMBench score of 75.2%. Researchers noted that the inclusion of synthetically generated data significantly improved the model's performance; without these additional data, the model's average performance would decrease by 2.4%.
The research team has decided to make the dataset and model available to the research community, with training primarily conducted using Nvidia A100 GPUs and Chinese-made chips. The successful launch of Aquila-VL-2B marks the trend of open-source models catching up with traditional closed-source systems in AI research, particularly in leveraging synthetic training data.
Infinity-MM paper link: https://arxiv.org/abs/2410.18558
Aquila-VL-2B project link: https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen
Key points:
🌐 The "Infinity-MM" dataset includes 10 million image descriptions and 24.4 million visual instruction data.
💡 The new model Aquila-VL-2B has performed excellently in multiple benchmark tests, setting new records for similar models.
📈 The use of synthetic data significantly enhanced the model's performance, prompting the research team to open the dataset and model to the community.