IBM Unveils Comprehensive 6.48 TB LLM Training Dataset

AIbase

Published inAI News · 4 min read · Jul 4, 2024

IBM opened the Granite13B LLM model for enterprise applications in May. Now, Armand Ruiz, the vice president of IBM's AI platform products, has publicly disclosed the full content of the comprehensive 6.48TB dataset used to train Granite13B.

The dataset, after strict preprocessing, was reduced to 2.07TB, a reduction of 68%. Ruiz emphasizes that this step is crucial for ensuring a high-quality, unbiased, ethical, and legally compliant dataset to meet the needs of enterprise applications.

The dataset is carefully curated from multiple sources, including:

- arXiv: Over 2.4 million preprint scientific papers.

- Common Crawl: An open web crawling database.

- DeepMind Mathematics: Math question and answer pairs.

- Free Law: Public domain legal opinions from U.S. courts.

- GitHub Clean: Code data from CodeParrot.

- Hacker News: Computer science and entrepreneur news from 2007 to 2018.

- OpenWeb Text: The open-source version of OpenAI's Web Text corpus.

- Project Gutenberg (PG-19): Free ebooks focusing on early works.

- Pubmed Central: Biomedical and life science papers.

- SEC Filings: 10-K/Q submissions from the U.S. Securities and Exchange Commission (SEC) from 1934 to 2022.

- Stack Exchange: User contributions on the Stack Exchange network.

- USPTO: U.S. patents granted from May 1975 to May 2023.

- Webhose: Conversion of unstructured web content into machine-readable data.

- Wikimedia: Eight English Wikimedia projects.

The preprocessing process includes text extraction, deduplication, language recognition, sentence splitting, hate, abuse, and profanity labeling, document quality labeling, URL masking labeling, filtering, and tokenization.

These steps involve annotation and filtering based on set thresholds to ensure that the final dataset is of the highest quality for model training.

IBM has released four versions of the Granite code models with parameters ranging from 3 billion to 34 billion. These models have been tested on a series of benchmark tests and outperformed other comparable models, such as Code Llama and Llama3.

Key points:

⭐ IBM released the complete 6.48TB dataset for training the Granite13B LLM model.

⭐ The dataset was reduced to 2.07TB after strict preprocessing, a reduction of 68%.

⭐ IBM released four versions of the Granite code models with parameters ranging from 3 billion to 34 billion.

Shortcut Makes Its Debut! AI Excel Assistant Surpasses Human Champions by 10 Times, Task Automation Efficiency Soars

Recently, an AI Excel assistant called Shortcut has sparked heated discussions on social media. It enables users to effortlessly complete Excel tasks without writing complex formulas or VBA code through natural language processing (NLP) technology. The AIbase editorial team has compiled the latest information from social media to provide an in-depth analysis of Shortcut's powerful features and its potential impact on the fields of data processing and financial modeling. Shortcut: An Excel Revolution Driven by Natural Language

KPMG Report: China Leads in Medical Large Models, Accounting for 70% of the Global Total

A recent report titled "Health Tech 50 - The First Edition" released by KPMG China reveals that China has taken a leading position in the field of medical large models globally. The report indicates that the number of medical large models launched in China accounts for more than 70% of the global total, far surpassing other countries and regions. In terms of model categories, large language models (LLMs) are the most numerous, accounting for nearly 65%. Moreover, the report also highlights the strong growth momentum of the intelligent medical devices market in China. It is expected that by 2025, the scale of the intelligent medical devices market in China will reach 24.23 billion yuan, and it will continue to grow.

Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

Large Language Models (LLMs) have achieved significant progress in complex reasoning tasks by combining task prompts with large-scale reinforcement learning (RL), as demonstrated by models like Deepseek-R1-Zero, which directly apply reinforcement learning to base models, showcasing strong reasoning capabilities. However, this success is difficult to replicate across different base model families, especially within the Llama series. This raises a core question: what factors lead to inconsistent performance of different base models during reinforcement learning? How does reinforcement learning perform in

Amazon Launches New AI Model Deep Fleet, Robot Count Exceeds One Million

In a recent major announcement, the global e-commerce and cloud computing giant Amazon revealed two important milestones in its robotics and artificial intelligence (AI) fields: the launch of a new AI foundation large model called Deep Fleet, and the successful deployment of more than one million robots. The launch of the Deep Fleet model aims to enhance the intelligence and efficiency of Amazon's largest industrial mobile robot fleet. The application of this model is expected to improve the operational efficiency of the robot fleet by 10%, thereby accelerating...

OpenAI Suspends Large-Scale Use of Google TPU Chips, NVIDIA and AMD Remain Core Suppliers

OpenAI recently announced that, despite initial testing, it will not adopt Google's TPU chips on a large scale. TPU (Tensor Processing Unit) is a custom ASIC chip developed by Google for machine learning tasks, designed to accelerate the training and inference of neural networks. TPU uses a dataflow-driven architecture, enabling efficient matrix multiplication pipeline computing and reducing memory access latency. Image source note: The image is AI-generated, provided by the licensing service Midjourney. OpenAI stated that it will continue