Not long ago, the research institution Epochai, which specializes in the AIGC field, released a significant research report. The report states that the high-quality text training datasets publicly available to us humans consist of approximately 300 trillion tokens. However, with the increasing appetite of large models like ChatGPT, these data might be exhausted between 2026 and 2032!
Researchers specifically point out that "overtraining" is the main culprit accelerating the consumption of training data. For instance, the latest open-source Llama3, 8B version, has been overtrained by an astonishing 100 times! If all models were to follow this approach, our data might be depleted as early as 2025.
But don't worry, we still have strategies. Epoch ai proposes four new methods for acquiring training data, ensuring that the AI industry's "data famine" is no longer a nightmare.
1) Synthetic data: Similar to a gourmet meal made from a seasoning packet, synthetic data uses deep learning to simulate real data, generating new data. However, don't get too excited; the quality of synthetic data may vary, and it can easily overfit, lacking the subtle linguistic features of real text.
2) Multimodal and cross-domain data learning: This method is not limited to text but also includes various data types such as images, videos, and audio. It's like being in a KTV, where you can sing, dance, and perform—multimodal learning allows models to comprehensively understand and handle complex tasks.
3) Private data: Currently, the global private text data totals approximately 3100 trillion tokens, more than ten times the public data! However, using private data requires caution, as privacy and security are paramount. Moreover, the process of acquiring and integrating non-public data can be quite complex.
4) Real-time interactive learning with the real world: This method allows models to learn and improve through direct interaction with the real world. It requires models to have autonomy and adaptability, accurately understanding user instructions and taking action in the real world.