Google Research Shows Synthetic Data Improves Large Model Logical Reasoning Eightfold

AIbase基地

Published inAI News · 3 min read · Apr 7, 2025

In a recent study, Google collaborated with Carnegie Mellon University and the MultiOn team to investigate the impact of synthetic data on training large language models (LLMs). They found that synthetic data significantly improved the logical reasoning capabilities of LLMs, particularly in solving mathematical problems, resulting in an astounding eightfold increase in performance. This discovery holds significant implications given the current scarcity of training data.

Currently, approximately 300 trillion high-quality text data points are available globally. However, with the increasing popularity of models like ChatGPT, the demand for training data is rapidly escalating, projected to outstrip supply by 2026. Against this backdrop, synthetic data emerges as a crucial alternative.

The research team primarily explored two types of synthetic data: positive and negative data. Positive data consists of correct problem solutions generated by high-performance models like GPT-4 and Gemini 1.5 Pro, providing examples for other models to learn from. However, relying solely on positive data has limitations. Models might learn through pattern matching without truly understanding the problem-solving process, hindering their generalization ability.

To overcome these limitations, the team introduced negative data—incorrect problem-solving steps. This data helps models identify common errors, thereby enhancing their logical reasoning. While using negative data presents challenges due to potentially misleading information, the researchers employed Direct Preference Optimization (DPO) to enable effective learning from mistakes, clarifying the importance of each step in the problem-solving process.

The study utilized models such as DeepSeek-Math-7B and LLaMa2-7B, conducting extensive testing on the GSM8K and MATH datasets. Results showed that LLMs pre-trained with both positive and negative synthetic data exhibited an eightfold improvement in mathematical reasoning tasks. This research not only demonstrates the immense potential of synthetic data in boosting the logical reasoning capabilities of LLMs but also offers new avenues for future model training.

Open-Source Revolution! Step1X-Edit Lands on Hugging Face, Generating Images with Natural Language, Rivaling GPT-4o!

Step1X-Edit, a groundbreaking open-source AI model, has arrived on Hugging Face. This powerful tool allows users to create images using natural language descriptions, demonstrating performance comparable to GPT-4o. This release marks a significant advancement in accessible AI image generation technology.

GPT-4o's Image Generation Integrated into GPTs: A New Era of Personalized Image Bots

OpenAI has announced the official integration of GPT-4o's image generation capabilities into the GPTs (custom GPT) platform, providing developers and creators with powerful tools to build personalized image generation robots. According to AIbase, this update allows users to create custom image generation applications through GPTs, such as poster design robots or generators for specific artistic styles, significantly enhancing creative flexibility and sharing. The enthusiastic discussions on social media highlight its widespread impact; the feature is already available to ChatGPT Plus and P users.

Zhipu AI and Shengshu Technology Announce Strategic Partnership to Focus on Large Model Joint Innovation

On April 27, Zhipu AI (Z.ai) and Shengshu Technology (shengshu.com), two leading artificial intelligence companies under Tsinghua University, announced a major strategic partnership. This collaboration aims to leverage both companies' technological expertise in large language models and multi-modal generative models to jointly advance the technological innovation and industrial application of domestic large models.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Google Research Shows Synthetic Data Improves Large Model Logical Reasoning Eightfold

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Open-Source Revolution! Step1X-Edit Lands on Hugging Face, Generating Images with Natural Language, Rivaling GPT-4o!

ByteDance Unveils QuaDMix: A Unified Framework for Large Language Model Pre-training Data Quality and Diversity

GPT-4o's Image Generation Integrated into GPTs: A New Era of Personalized Image Bots

Zhipu AI and Shengshu Technology Announce Strategic Partnership to Focus on Large Model Joint Innovation

OpenAI Launches New ChatGPT Version: Smarter, More Intuitive GPT-4o

Step1X-Edit: A New Benchmark in Open-Source Image Editing, Rivaling Closed-Source Models like GPT-4o

GPT-4's Image Generation Capabilities Now Integrated into Custom GPTs

AI Daily: Baidu Unveils Wenxin Large Model X1Turbo and AI Open Program; OpenAI Offers Free Lightweight Deep Research; iDream Video 3.0 Internal Testing

Doubao 1.5 Deep Thinking Model Launches on Edge Large Model Gateway with Free Million Tokens

Baidu's Li Yanhong Unveils Ernie Bot's Twin Stars: X1 Turbo Directly Targets DeepSeek 4.5 Turbo, Surpassing GPT-4o