Together AI Releases RedPajama v2: A 30 Trillion Token Dataset for Large Language Model Training

站长之家

Published inAI News · 2 min read · Nov 6, 2023

Together AI has recently released RedPajama v2, a massive online dataset containing 30 trillion tokens, specifically designed for the training of large-scale language models. High-quality data is crucial for the success of large open-source language models like Llama, Mistral, Falcon, MPT, and RedPajama. The construction of RedPajama-V2 emphasizes coverage of CommonCrawl, including raw data, high-quality annotations, and deduplicated clusters, providing a robust foundation for training language models. The release of this dataset holds significant importance for the fields of AI research and application, offering support and a basis for developing more powerful language models, and is expected to further advance the AI field.

RedPajama v2 Large Language Models Dataset

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Microsoft Unveils GeoMap-Bench to Advance Intelligent Understanding of Geological Maps

In geoscience, geological maps are crucial tools for understanding the Earth's surface and subsurface structures. However, interpreting these complex diagrams requires specialized knowledge and extensive experience. To enhance intelligence in this field, Microsoft Research Asia recently introduced GeoMap-Bench, a new benchmark dataset for evaluating the performance of multimodal large language models (MLLMs) in understanding geological maps. The launch of GeoMap-Bench marks a significant step forward in AI applications for geological map interpretation. Microsoft researchers, in collaboration with...

Mar 24, 2025

160

Ant Group Unveils Two Innovative MoE Large Language Models with Significantly Reduced Training Costs

Ant Group's Ling team recently published a preprint on arXiv titled "Every FLOP Matters: Scaling a 300-billion parameter Mixture-of-Experts LING model without high-end GPUs," detailing two novel large language models: Ling-Lite and Ling-Plus. These models incorporate several innovations enabling efficient training on low-performance hardware, significantly reducing training costs.

Mar 24, 2025

520

Tuosda X5 Platform: Breaking Down Data Barriers Between Robots and Large Language Models

In today's rapidly advancing AI landscape, Tuosda recently revealed innovative features of its next-generation robot control platform—the X5 Platform—during its investor relations event. The X5 Platform utilizes a cloud-edge-end architecture, deeply integrating high-performance computing with intelligent robot control to achieve real-time data transmission and efficient execution of intelligent decisions. This platform not only complements traditional robotics technology but also bridges the gap between embodied intelligence and large language model applications. Specifically, the X5 Platform...

Mar 16, 2025

320

Survey: 52% of US Adults Have Used AI Chatbots

A survey by Elon University shows that 52% of US adults have used AI large language models like ChatGPT, Gemini, Claude, and Copilot. The January survey, conducted by the Imagining the Digital Future initiative at Elon University in North Carolina, polled 500 respondents. Of those who have used AI, 34% reported using large language models at least once a day. ChatGPT was the most popular, used by 72% of respondents; Google's G...

Mar 13, 2025

170

Revolutionizing Long-Document Reasoning with APB: A 10x Speedup Over Flash Attention

Frustrated by the slow processing speed of large language models on long documents? Researchers from Tsinghua University have unveiled a groundbreaking technology – the APB parallel inference framework – that dramatically accelerates processing. Benchmark tests show this technology achieves a 10x speed improvement over Flash Attention when handling ultra-long texts. With the rise of models like ChatGPT, AI's ability to process vast amounts of text (hundreds of thousands of words) has increased significantly. However, this often comes at the cost of processing speed...

Mar 13, 2025

110

LLMs.txt Generator v2 Released: 10x Faster Website Text Conversion

The LLMs.txt Generator has received a major update with the release of version 2. This tool quickly converts any website content into text files usable by AI agents or Large Language Models (LLMs), greatly benefiting AI application developers and users. Developed by the @firecrawl_dev team and fully supported by their official llmstxt endpoint, the new version boasts an incredible 10x speed improvement over its predecessor. The LLMs.txt Generator v2...

Mar 12, 2025

240

2025 AI Investment Boom Continues: Nine US Companies Secure Over $100 Million in Funding

2024 was a landmark year for the AI industry. According to TechCrunch, 49 startups secured over $100 million in funding, with seven raising over $1 billion and three securing multiple mega-funding rounds. This momentum continues into 2025. Despite it being early in the year, the number of US AI companies receiving over $100 million in funding is nearing double digits, with at least one round exceeding $1 billion. The following are the companies that have secured over $100 million in funding so far in 2025...

Mar 10, 2025

190

Portkey AI Gateway: An Open-Source AI Solution for Easy Integration of Multiple Large Language Models

Mar 6, 2025

Douyin Group Seeks AI Data Annotation Suppliers with Minimum Registered Capital of One Million

Mar 6, 2025

BioChatter: An Open-Source Framework for BioMedical Research, Lowering the Barrier to LLM Use

Mar 5, 2025

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview