ByteDance recently announced QuaDMix, a novel data selection framework designed to enhance the efficiency and generalization capabilities of Large Language Model (LLM) pre-training. It's widely known that model training effectiveness is heavily influenced by the quality and diversity of the underlying dataset. However, traditional data filtering methods often treat quality and diversity as separate objectives, prioritizing quality filtering before addressing domain balance.

QQ_1745804240748.png

This stepwise optimization approach overlooks the complex interplay between quality and diversity. High-quality datasets often exhibit domain bias, while diverse datasets might compromise quality. Therefore, optimizing both dimensions simultaneously to maximize model performance under a fixed training budget presents a significant challenge.

The QuaDMix framework operates in three stages: feature extraction, quality aggregation, and quality-diversity-aware sampling. Initially, each document is annotated with domain labels and multiple quality scores. These scores are normalized and combined to generate a comprehensive quality score. Subsequently, the system samples documents using a sigmoid-based function, prioritizing high-quality samples while ensuring domain balance through parameterized control.

To optimize the model, QuaDMix trains thousands of surrogate models under various parameter settings. A regression model trained on these surrogate experiments predicts performance outcomes, identifying the optimal sampling configuration. This approach enables structured exploration within a high-dimensional parameter space, better aligning data selection with downstream tasks.

Experimental results on the RefinedWeb dataset show that QuaDMix achieves an average score of 39.5%, outperforming various baseline models including random selection, Fineweb-edu, AskLLM, and DCLM. The results demonstrate that the joint optimization strategy consistently surpasses methods focusing solely on quality or diversity. Furthermore, the optimized data mix enhances performance on specific downstream tasks.

QuaDMix provides a systematic solution for pre-training data selection in LLMs, addressing the long-standing challenge of simultaneously optimizing data quality and diversity. By combining quality aggregation and domain-aware sampling, QuaDMix establishes a scalable methodology that improves the efficiency of LLM pre-training.

Key Highlights:

🌟 QuaDMix is a new framework from ByteDance designed to simultaneously optimize data quality and diversity in Large Language Model (LLM) pre-training.

📈 The framework achieves data selection through a three-stage process: feature extraction, quality aggregation, and quality-diversity-aware sampling.

🔍 Experimental results demonstrate QuaDMix's superior performance across multiple benchmarks, achieving an average score of 39.5% and surpassing various traditional methods.