Recently, the AI community has encountered an unusual phenomenon, akin to a food-eating vlogger suddenly starting to consume their own dishes, becoming increasingly addicted, yet the dishes become progressively inedible. This situation, quite alarming when described, is technically referred to as model collapse.

What is model collapse? In simple terms, it occurs when an AI model, during its training process, excessively utilizes data generated by itself, leading to a vicious cycle where the quality of the model's outputs deteriorates significantly, eventually failing altogether.

This can be likened to a closed ecosystem where the AI model is the sole inhabitant, producing data as its food. Initially, it can find some natural ingredients (real data), but over time, it increasingly relies on its own "artificial" ingredients (synthetic data). The problem is that these "artificial" ingredients are nutritionally deficient and carry some inherent flaws of the model itself. Consuming too much of this leads to the AI model's "health" deteriorating, producing increasingly unreliable outputs.

image.png

This paper investigates the phenomenon of model collapse and seeks to answer two critical questions:

Is model collapse inevitable? Can the issue be resolved by blending real data with synthetic data?

Does a larger model size make collapse more likely?

To explore these questions, the authors designed a series of experiments and used a random projection model to simulate the training process of neural networks. They found that even a small percentage of synthetic data (e.g., 1%) could lead to model collapse. Moreover, as the model size increases, the phenomenon of collapse becomes more severe.

image.png

This is akin to a food-eating vlogger trying to attract attention by experimenting with bizarre ingredients, only to end up with a stomachache. To recover losses, they increase the intake of even more peculiar items, worsening the situation and ultimately forcing them out of the food-eating industry.

So, how can we avoid model collapse?

The authors of the paper suggest several strategies:

Prioritize real data: Real data is like natural ingredients, rich in nutrients, and crucial for the healthy development of AI models.

Use synthetic data cautiously: While synthetic data can supplement some nutrients, over-reliance can backfire.

Control model size: Larger models have bigger appetites and are more prone to "stomachaches." When using synthetic data, manage the model's size to avoid overfeeding.

Model collapse is a new challenge in the development of AI, reminding us that while pursuing model size and efficiency, we must also focus on data quality and model health. Only in this way can AI models continue to develop healthily and create greater value for human society.

Paper: https://arxiv.org/pdf/2410.04840