In a recent live conversation, Tesla and SpaceX CEO Elon Musk stated that the data available for training artificial intelligence models in the real world has almost been exhausted. He was in discussion with Mark Penn, Chairman of Stagwell. Musk mentioned, "We have basically consumed all the accumulated human knowledge... used for AI training data. This phenomenon essentially occurred last year."
Musk's viewpoint is similar to the "data peak" theory proposed by former OpenAI Chief Scientist Ilya Sutskever at the NeurIPS conference last December. Sutskever indicated that the AI industry is facing challenges due to data shortages, and the lack of sufficient training data in the future will force a change in the way AI models are developed.
To address this issue, Musk believes that synthetic data will become a viable alternative. He pointed out that the only way to supplement real-world data is through synthetic data, which involves AI generating training data itself. Musk stated that AI can enhance its performance by self-evaluating and continuously optimizing itself.
Currently, many tech companies, such as Microsoft, Meta, OpenAI, and Anthropic, have begun using synthetic data to train their main AI models. According to Gartner's predictions, by 2024, 60% of the data used for AI and data analytics projects will be synthetically generated.
One significant advantage of synthetic data is that it can significantly reduce development costs. However, Musk and other experts also point out that synthetic data is not without risks. Research shows that synthetic data may lead to a decline in model performance, the output results may lack creativity, and may be influenced by biases. If the synthetic data itself has limitations, the final model's output will also be troubled by these issues.
Key Points:
🌍 The data available for training AI in the real world is nearly exhausted, which concerns Musk.
💡 Synthetic data is considered an important solution for the future, and many tech companies have begun to adopt it.
💰 Using synthetic data can significantly reduce development costs, but there is also a risk of potentially decreasing model performance.