Recently, a joint research team from Google, Carnegie Mellon University, and MultiOn published a new study on the application of synthetic data in large model training. According to a report by Epoch AI, a research institute focusing on AI development, there are currently about 300 trillion publicly available high-quality text training tokens. However, with the rapid development of large models like ChatGPT, the demand for training data is growing exponentially, and it's projected that this data will be exhausted before 2026. Therefore, synthetic data is becoming a crucial alternative.
The researchers explored two main types of synthetic data: positive data and negative data. Positive data refers to correct problem solutions generated by high-performance large models (such as GPT-4 and Gemini 1.5 Pro), providing the model with examples of how to solve mathematical problems. However, relying solely on positive data for training has limitations. First, this approach may not fully reveal the underlying logic of the problem-solving process; the model might learn through pattern matching without true understanding. Second, as the amount of training data increases, the model may learn spurious correlations, leading to decreased generalization ability when dealing with new problems.
Therefore, the researchers introduced negative data, which includes problem-solving steps verified as incorrect. This helps the model identify and avoid errors, enhancing its logical reasoning capabilities. Although utilizing negative data presents challenges because incorrect steps may contain misleading information, the researchers successfully enabled the model to learn from mistakes through Direct Preference Optimization (DPO), emphasizing the importance of each problem-solving step.
The DPO method assigns an advantage value to each problem-solving step, reflecting its value relative to the ideal solution. The study shows that high-advantage steps are key to correct solutions, while low-advantage steps may indicate problems in the model's reasoning. Using these advantage values, the model can dynamically adjust its strategy within a reinforcement learning framework to learn and improve more efficiently from synthetic data.
To validate the effectiveness of synthetic data, the research team conducted comprehensive tests on the GSM8K and MATH datasets using models like DeepSeek-Math-7B and LLaMa2-7B. Results showed that large models pre-trained with both positive and negative synthetic data achieved an eightfold improvement in performance on mathematical reasoning tasks. This research demonstrates the immense potential of synthetic data in enhancing the logical reasoning capabilities of large models.
Key Highlights:
📊 Synthetic data offers an effective solution to the growing demand for training data.
🧩 Combining positive and negative data enhances the model's mathematical reasoning and logical abilities.
🚀 The study shows an eightfold improvement in the reasoning ability of large models after pre-training with synthetic data.