AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation

ByteDance Unveils QuaDMix: A Unified Framework for Large Language Model Pre-training Data Quality and Diversity

AIbase基地

Published inAI News · 4 min read · Apr 28, 2025

ByteDance recently announced QuaDMix, a novel data selection framework designed to enhance the efficiency and generalization capabilities of Large Language Model (LLM) pre-training. It's widely known that model training effectiveness is heavily influenced by the quality and diversity of the underlying dataset. However, traditional data filtering methods often treat quality and diversity as separate objectives, prioritizing quality filtering before addressing domain balance.

This stepwise optimization approach overlooks the complex interplay between quality and diversity. High-quality datasets often exhibit domain bias, while diverse datasets might compromise quality. Therefore, optimizing both dimensions simultaneously to maximize model performance under a fixed training budget presents a significant challenge.

The QuaDMix framework operates in three stages: feature extraction, quality aggregation, and quality-diversity-aware sampling. Initially, each document is annotated with domain labels and multiple quality scores. These scores are normalized and combined to generate a comprehensive quality score. Subsequently, the system samples documents using a sigmoid-based function, prioritizing high-quality samples while ensuring domain balance through parameterized control.

To optimize the model, QuaDMix trains thousands of surrogate models under various parameter settings. A regression model trained on these surrogate experiments predicts performance outcomes, identifying the optimal sampling configuration. This approach enables structured exploration within a high-dimensional parameter space, better aligning data selection with downstream tasks.

Experimental results on the RefinedWeb dataset show that QuaDMix achieves an average score of 39.5%, outperforming various baseline models including random selection, Fineweb-edu, AskLLM, and DCLM. The results demonstrate that the joint optimization strategy consistently surpasses methods focusing solely on quality or diversity. Furthermore, the optimized data mix enhances performance on specific downstream tasks.

QuaDMix provides a systematic solution for pre-training data selection in LLMs, addressing the long-standing challenge of simultaneously optimizing data quality and diversity. By combining quality aggregation and domain-aware sampling, QuaDMix establishes a scalable methodology that improves the efficiency of LLM pre-training.

Key Highlights:
🌟 QuaDMix is a new framework from ByteDance designed to simultaneously optimize data quality and diversity in Large Language Model (LLM) pre-training.
📈 The framework achieves data selection through a three-stage process: feature extraction, quality aggregation, and quality-diversity-aware sampling.
🔍 Experimental results demonstrate QuaDMix's superior performance across multiple benchmarks, achieving an average score of 39.5% and surpassing various traditional methods.

QuaDMix LargeLanguageModel(LLM)DataSelectionFramework TokenJumping

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

ByteDance Releases Efficient Pre-training Length Scaling Technology, Breaking Through Long Sequence Training Bottlenecks

Apr 23, 2025

270

Microsoft MarkItDown MCP Converts Word, Excel, and more to Markdown

Apr 21, 2025

500

LMArena Officially Launches, Dedicated to Providing a Neutral AI Evaluation Platform

Apr 18, 2025

160

Tsinghua and Shanghai AI Lab Jointly Develop Novel Process Reward Model, Enabling Smaller Models to Surpass GPT-4

Apr 14, 2025

430

Google Releases 69-Page White Paper: Optimizing AI Models Through Prompt Engineering

Apr 11, 2025

193.6k

NVIDIA Unveils Llama 3.1 Nemotron Ultra 253B, Outperforming Llama 4 Behemoth

Apr 9, 2025

590

OpenAI Releases PaperBench, a Benchmark for Evaluating AI Agents

Apr 3, 2025

550

NVIDIA AI Researchers Introduce FFN Fusion Technology: Accelerating Large Language Model Inference

Mar 31, 2025

600

Breaking Free from Stiff AI: Midjourney and NYU Unlock New Dimensions in Creative Text Generation, Diversity Soars by 23%!

Researchers from Midjourney and New York University have collaborated on a novel approach to significantly enhance the diversity of creative text generated by language models while minimizing quality loss. Detailed in a recent research paper, this technique centers on incorporating a 'deviation metric' into the AI's training process. It works by quantifying the difference between each generated text and other texts created for the same prompt. Researchers utilize text embeddings and their pairwise cosine distances to calculate these differences, thereby providing the system with...

Mar 30, 2025

490

Google AI Releases TxGemma: A New Large Language Model for Drug Discovery

Mar 28, 2025

410