AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation

Synthetic Data is Toxic! Meta Team Confirms: 1% of Data Can Completely Collapse Large Models

AIbase基地

Published inAI News · 5 min read · Oct 14, 2024

270

Recently, the AI community has encountered an unusual phenomenon, akin to a food-eating vlogger suddenly starting to consume their own dishes, becoming increasingly addicted, yet the dishes become progressively inedible. This situation, quite alarming when described, is technically referred to as model collapse.

What is model collapse? In simple terms, it occurs when an AI model, during its training process, excessively utilizes data generated by itself, leading to a vicious cycle where the quality of the model's outputs deteriorates significantly, eventually failing altogether.

This can be likened to a closed ecosystem where the AI model is the sole inhabitant, producing data as its food. Initially, it can find some natural ingredients (real data), but over time, it increasingly relies on its own "artificial" ingredients (synthetic data). The problem is that these "artificial" ingredients are nutritionally deficient and carry some inherent flaws of the model itself. Consuming too much of this leads to the AI model's "health" deteriorating, producing increasingly unreliable outputs.

This paper investigates the phenomenon of model collapse and seeks to answer two critical questions:

Is model collapse inevitable? Can the issue be resolved by blending real data with synthetic data?

Does a larger model size make collapse more likely?

To explore these questions, the authors designed a series of experiments and used a random projection model to simulate the training process of neural networks. They found that even a small percentage of synthetic data (e.g., 1%) could lead to model collapse. Moreover, as the model size increases, the phenomenon of collapse becomes more severe.

This is akin to a food-eating vlogger trying to attract attention by experimenting with bizarre ingredients, only to end up with a stomachache. To recover losses, they increase the intake of even more peculiar items, worsening the situation and ultimately forcing them out of the food-eating industry.

So, how can we avoid model collapse?

The authors of the paper suggest several strategies:

Prioritize real data: Real data is like natural ingredients, rich in nutrients, and crucial for the healthy development of AI models.

Use synthetic data cautiously: While synthetic data can supplement some nutrients, over-reliance can backfire.

Control model size: Larger models have bigger appetites and are more prone to "stomachaches." When using synthetic data, manage the model's size to avoid overfeeding.

Model collapse is a new challenge in the development of AI, reminding us that while pursuing model size and efficiency, we must also focus on data quality and model health. Only in this way can AI models continue to develop healthily and create greater value for human society.

Paper: https://arxiv.org/pdf/2410.04840

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team