Meta AI Researcher: The Text on the Internet is All 'Garbage,' Llama 3 is Complete Synthetic Data

AIbase基地

Published inAI News · 4 min read · Jul 25, 2024

254

Recently, Meta AI researcher Thomas Scialom shared insights on their latest project, Llama3, in an interview. He candidly pointed out that the vast amount of text on the internet is of varying quality, and training on such data is a waste of resources. Therefore, Llama3's training process does not rely on any human-written answers but is entirely based on synthetic data generated by Llama2.

Discussing the training details of Llama3, Scialom elaborated on the application of synthetic data in various fields. For instance, in code generation, they employed three different methods to generate synthetic data, including feedback from code execution, translation of programming languages, and reverse translation of documentation. In mathematical reasoning, they drew on the "let's verify step by step" research method for data generation. Additionally, Llama3 continued pre-training with 90% multilingual tokens to collect high-quality human annotations, which is particularly important in multilingual processing.

Long text processing is also a focus for Llama3, relying on synthetic data for long text Q&A, long document summarization, and codebase reasoning. In terms of tool usage, Llama3 was trained on Brave search, Wolfram Alpha, and Python interpreters to achieve single, nested, parallel, and multi-round function calls.

Scialom also mentioned the importance of Reinforcement Learning with Human Feedback (RLHF) in Llama3's training. They extensively utilized human preference data to train the model and emphasized the human ability to make choices (such as preferring one poem over another), rather than creating from scratch.

Meta has already begun training Llama4 in June, with Scialom revealing that a major focus of Llama4 will be around agents. He also mentioned a multimodal version of Llama, which will have more parameters and is planned for release in the near future.

Scialom's interview sheds light on the latest advancements and future directions of Meta AI in the field of artificial intelligence, particularly in how to leverage synthetic data and human feedback to enhance model performance.

AI Daily: Tencent Yuanbao Upgrades for One-Phrase Image and Video Search; WeChat Pay MCP Launches; Google Unveils Veo 3 Globally

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Each day, we present you with the latest content in the AI field, focusing on developers to help you understand technical trends and innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1. Tencent Yuanbao upgrades again: one phrase search, images and videos appear instantly, making information retrieval more intuitive! The upgraded features of Tencent Yuanbao make information retrieval more intuitive and efficient. Users just need to ask a question in one phrase to get text and image results.

Google Launches New Veo 3 Video Generation Model Globally

Google announced the global launch of its latest video generation model, Veo3. This long-anticipated release has generated great excitement among users, as Veo3 is now available to Gemini users in over 159 countries, offering a new video creation experience. The key feature of the Veo3 video generation model is its ability to generate videos up to eight seconds long based on simple text prompts. According to Google, this technology is designed for creative users, especially those on social media who increasingly demand short-form content.

Kunlun Xiwang Once Again Open-Sources the Reward Model Skywork-Reward-V2

On July 4, 2025, Kunlun Xiwang continued to open-source the second-generation reward model Skywork-Reward-V2 series. This series includes 8 reward models based on different foundation models, with parameter sizes ranging from 600 million to 8 billion. Upon its release, it won all seven major reward model evaluation rankings, becoming a focus in the open-source reward model field. Reward models play a key role in the reinforcement learning from human feedback (RLHF) process. To build the next generation of reward models, Kunlun Xiwang has constructed a dataset containing 40 million

New Developments in OpenAI Copyright Lawsuit: The New York Times Will Have Access to Deleted User Data

In the long-standing copyright infringement lawsuit filed by The New York Times against OpenAI, the case has made significant progress. According to Ars Technica, the federal judge presiding over the case has authorized The New York Times and its co-plaintiffs, The New York Daily News and the Investigative Reporting Center, to access OpenAI's user logs, including deleted content, to accurately determine the scope of the infringement. The New York Times believes that ChatGPT users may delete their history after bypassing the paywall, and therefore it is necessary to conduct large-scale data collection.

Shortcut Makes Its Debut! AI Excel Assistant Surpasses Human Champions by 10 Times, Task Automation Efficiency Soars

Recently, an AI Excel assistant called Shortcut has sparked heated discussions on social media. It enables users to effortlessly complete Excel tasks without writing complex formulas or VBA code through natural language processing (NLP) technology. The AIbase editorial team has compiled the latest information from social media to provide an in-depth analysis of Shortcut's powerful features and its potential impact on the fields of data processing and financial modeling. Shortcut: An Excel Revolution Driven by Natural Language

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Meta AI Researcher: The Text on the Internet is All 'Garbage,' Llama 3 is Complete Synthetic Data

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Open-Source Multimodal Large Model EarthMind: A Revolutionary Tool for Analyzing Earth Observation Data

B站AniSora V3 Launches with a Strong Impact: A Faster and More Efficient Anime Video Generation Tool

AI Daily: Tencent Yuanbao Upgrades for One-Phrase Image and Video Search; WeChat Pay MCP Launches; Google Unveils Veo 3 Globally

Google Launches New Veo 3 Video Generation Model Globally

Kunlun Xiwang Once Again Open-Sources the Reward Model Skywork-Reward-V2

Google Veo 3 Video Generation Model Now Available to Pro/Ultra Subscribers, Will Add Photo-to-Video Function

New Developments in OpenAI Copyright Lawsuit: The New York Times Will Have Access to Deleted User Data

Shortcut Makes Its Debut! AI Excel Assistant Surpasses Human Champions by 10 Times, Task Automation Efficiency Soars

A Daily: Bilibili Upgrades Anime Video Generation Model AniSora V3; ByteDance Open Sources 4D Video Generation Framework EX-4D; DeepSWE Open Sources AI Agent System Rises to the Top

ByteDance Open Sources New Model VINCIE-3B: 300 Million Parameters Support Continuous Image Editing with Context