Together AI has recently released RedPajama v2, a massive online dataset containing 30 trillion tokens, specifically designed for the training of large-scale language models. High-quality data is crucial for the success of large open-source language models like Llama, Mistral, Falcon, MPT, and RedPajama. The construction of RedPajama-V2 emphasizes coverage of CommonCrawl, including raw data, high-quality annotations, and deduplicated clusters, providing a robust foundation for training language models. The release of this dataset holds significant importance for the fields of AI research and application, offering support and a basis for developing more powerful language models, and is expected to further advance the AI field.