A recent joint study by Google, Carnegie Mellon University, and MultiOn explores the application of synthetic data in training large language models. According to Epoch AI, a research institution focused on AI development, currently available high-quality text training data totals around 300 trillion tokens. However, with the rapid advancement of large models like ChatGPT, the demand for training data is growing exponentially, projected to exhaust existing resources by 2026. Therefore, synthetic data is becoming increasingly crucial.