Recently, the issue of data scarcity in AI large-scale model training has once again become a focal point of media attention. The latest article in The Economist, titled "AI Companies Are Rapidly Exhausting Most Internet Data," has sparked widespread discussion in the industry. The article highlights that as high-quality data on the internet dries up, the AI field is facing the challenge of a "data wall."
Research firm Epoch AI predicts that by 2028, all high-quality textual data on the internet will be exhausted, and machine learning datasets may deplete all "high-quality language data" as early as 2026. This "data wall" phenomenon has become a significant issue for AI companies, potentially slowing their training progress.
Image Source Note: The image is generated by AI, provided by the image licensing service Midjourney
The industry has long been alert to this issue. In July 2023, Stuart Russell, a professor at the University of California, Berkeley, warned that AI-driven robots like ChatGPT might soon "exhaust the text in the universe." However, there are differing views. In May 2024, Stanford University professor Fei-Fei Li stated that there is still a wealth of differentiated data waiting to be mined to build more customized models.
To address the data shortage, the use of synthetic data has emerged as a potential solution. However, a recent paper in Nature magazine points out that training future generations of machine learning models with AI-generated datasets could lead to "model collapse," causing models to misinterpret reality. The research team suggests retaining some original data in the training set, using diverse data sources, and researching more robust training algorithms.
How to break through the "data wall" limitation and ensure the continuous supply of high-quality training data has become an urgent issue for the AI industry. This not only requires technological innovation but also the collaborative efforts of governments, businesses, and research institutions. As AI technology increasingly integrates into various industries, solving the data scarcity problem will have a profound impact on the sustained and healthy development of AI.