The Beijing Academy of Artificial Intelligence (BAAI) has introduced a massive instruction tuning dataset called Infinity-Instruct, designed to enhance the performance of language models in conversational tasks. Recently, Infinity-Instruct has undergone a new round of iterations, including the Infinity-Instruct-7M foundational instruction dataset and the Infinity-Instruct-Gen conversational instruction dataset.

The Infinity-Instruct-7M foundational instruction dataset contains over 7.44 million data entries, covering domains such as mathematics, coding, and common sense question-answering, aiming to improve the foundational capabilities of pre-trained models. Test results indicate that models fine-tuned with this dataset, such as Llama3.1-70B and Mistral-7B-v0.1, have demonstrated capabilities close to those of official conversational models released by the authorities. Notably, Mistral-7B even surpasses GPT-3.5, and Llama3.1-70B is on par with GPT-4.

WeChat Screenshot_20240924091124.png

The Infinity-Instruct-Gen conversational instruction dataset includes 1.49 million synthetic complex instructions, intended to enhance the robustness of models in real-world conversational scenarios. After further fine-tuning with this dataset, the models' performance can exceed that of official conversational models.

BAAI has tested Infinity-Instruct on mainstream evaluation benchmarks such as MTBench, AlpacaEval2, and Arena-Hard. The results show that models fine-tuned with Infinity-Instruct have surpassed official models in conversational abilities.

Infinity-Instruct provides detailed annotations for each instruction data, including language, capability type, task type, and data source, facilitating users to filter subsets of data according to their needs. BAAI has constructed a high-quality dataset through data selection and instruction synthesis to bridge the gap between open-source conversational models and GPT-4.

The project also employs the FlagScale training framework to reduce fine-tuning costs and utilizes MinHash deduplication and BGE retrieval to eliminate duplicate samples. BAAI plans to open-source the entire process code for data processing and model training in the future and explore extending the Infinity-Instruct data strategy to alignment and pre-training stages to support the full lifecycle data needs of language models.

Dataset Link:

https://modelscope.cn/datasets/BAAI/Infinity-Instruct