The Birth of 57 Trillion Quality Tokens: The 'Mysterious Treasure' of Large Language Model Training TxT360

AIbase基地

Published inAI News · 4 min read · Oct 8, 2024

228

In the realm of AI, data is akin to gold ore, the richer and more dazzling it is. Recently, LLM360 has launched an impressive dataset called TxT360, tailor-made for large language model training. This colossal dataset not only includes high-quality text data from various industries but has also undergone a global deduplication effort, ultimately amassing 5.7 trillion high-quality tokens, truly earning the title of "the treasure chest of the data world"!

The allure of TxT360 lies in its enormous scale and exceptional quality, surpassing existing datasets like FineWeb and RedPajama. This dataset has extracted the essence of the internet from 99 Common Crawl snapshots and additionally selected 14 high-quality data sources, such as legal documents and encyclopedias, ensuring its content is not only rich and diverse but also highly reliable.

Even cooler, TxT360 offers users a "data weighting adjustment recipe," allowing you to flexibly adjust the weights of different data sources according to your needs. It's like cooking where you can freely mix various ingredients according to taste, ensuring every bite is delicious.

Of course, the deduplication technology is also a highlight of TxT360. Through complex deduplication operations, this dataset effectively addresses data redundancy and information repetition issues during training, ensuring that each token is unique. Additionally, the project team has cleverly removed personal identification information, such as emails and IP addresses, from documents using regular expressions, thereby ensuring data privacy and security.

TxT360's design not only focuses on scale but also quality. By combining the advantages of web data and selected data sources, it allows researchers to precisely control the use and distribution of data, as if they have a magic remote control to adjust the data ratio at will.

In terms of training effectiveness, TxT360 is no less competitive. It has significantly increased the amount of data through a simple upsampling strategy, ultimately creating a dataset exceeding 15 trillion tokens. On a series of key evaluation metrics, TxT360 outperforms FineWeb, especially in areas like MMLU and NQ, demonstrating exceptional learning capabilities. When combined with code data (such as Stack V2), the learning curve becomes more stable, and model performance has seen a noticeable improvement.

Detailed introduction: https://huggingface.co/spaces/LLM360/TxT360

Anthropic Advances in Explainable AI Research or Reshapes Enterprise Large Language Model Strategy

An artificial intelligence research company, Anthropic, is developing AI systems with 'explainability' that enables users to understand the decision-making processes of large language models (LLMs). This breakthrough research may have a profound impact on enterprises' strategies for applying large language models. Breakthrough Technology: Peering into the AI 'Black Box' Anthropic's research focuses on solving the 'black box' problem of current generative AI systems.

json

{ title: Anthropic Develops Explanable AI Technology Or Will Reshape Enterprise Large Language Model Strategy, content: Artificial intelligence research company Anthropic recently announced that it is developing an 'explainable' AI system. This technology is expected to allow enterprises to better understand the decision-making process of large language models (LLMs). This breakthrough study may have a profound impact on companies' strategies for applying LLMs.}

Review of LLM SEO Monitor: Essential Tool for Brand Search Optimization in the AI Era

Gain insight into how this revolutionary tool, LLM SEO Monitor, helps brands monitor their performance in AI assistant search results like ChatGPT and Google Gemini. Analyze core functionalities, compare pricing pros and cons, and learn how to use it to bring new competitive advantages to brands in the AI search era.

Nanometer AI Super Search Intelligence Entity Explodes Upgrade! One-Click Generation of PPTs, Videos, and Voiceover Scripts; Medical Research Can Also Be Searched in Seconds

The Nanometer AI Super Search Intelligent Entity under the 360 Company has undergone a major update, adding new features such as multimodal content generation, cross-domain professional search, and smarter task preview functions. From one-click generation of PPTs, PDF reports to automatically integrating videos, voiceover scripts, and storyboard planning, Nano AI redefines the boundary of AI search and creation with more efficient and intuitive experiences. AIbase comprehensively organizes the latest social media dynamics to help you deeply understand the latest breakthroughs of Nano AI. Multimodal Generation: Handle everything from PPTs to videos with one click.

360 Group Unveils Nano AI Super Search Intelligence Body, Opening a New Era of Intelligent Search!

Recently, at a highly anticipated press conference, 360 Group announced that its 'Nano AI Search' will be upgraded to the new 'Nano AI Super Search Intelligence Body'. This upgrade not only marks further breakthroughs by 360 in the field of intelligent search, but also raises expectations for the application of artificial intelligence in information queries. It is reported that the 'Nano AI Super Search Intelligence Body' is equipped with more than 80 large models, enabling it to accurately break down users' search intentions. It can autonomously call various tools and content platforms for information inquiries, and can also...

Apple Again Criticized for AI Reasoning Ability: GitHub Celebrity Rebuttal: This Is Not the Real Picture of Reasoning Ability!

Recently, Apple Inc. released a controversial paper pointing out significant defects in the reasoning abilities of current large language models (LLMs). This view quickly sparked heated discussions on social media, especially among senior software engineers on GitHub, such as Sean Goedecke, who strongly opposed it. He believed that Apple's conclusions were too one-sided and could not fully reflect the capabilities of reasoning models. Apple's paper pointed out that the performance of LLMs was unreliable when solving benchmark tests such as mathematics and programming. The research team from Apple adopted

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

The Birth of 57 Trillion Quality Tokens: The 'Mysterious Treasure' of Large Language Model Training TxT360

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Anthropic Advances in Explainable AI Research or Reshapes Enterprise Large Language Model Strategy

json

In-Depth Review of LLM SEO Monitor: Essential Tool for AI Search Engine Optimization

Kimi-Dev-72B: Open Source Coding LLM Empowering Problem Solving in Software Engineering

Review of LLM SEO Monitor: Essential Tool for Brand Search Optimization in the AI Era

360 Group Unveils Nanometer AI Super Search Intelligence Body, Leading a New Era of Intelligent Analysis

Nanometer AI Super Search Intelligence Entity Explodes Upgrade! One-Click Generation of PPTs, Videos, and Voiceover Scripts; Medical Research Can Also Be Searched in Seconds

360 Group Unveils Nano AI Super Search Intelligence Body, Opening a New Era of Intelligent Search!

Xiaohongshu makes a major move! The all-new open-source large model dots.llm1震撼登场 with 142 billion parameters!

Apple Again Criticized for AI Reasoning Ability: GitHub Celebrity Rebuttal: This Is Not the Real Picture of Reasoning Ability!