Translation: Together AI releases the RedPajama v2 dataset, comprising 30 trillion tokens, designed for training large language models. This dataset aims to support the successful development of large language models by providing high-quality data resources. The dataset is sourced from CommonCrawl and other public web data, including over 40 clusters of quality annotations and deduplication. The RedPajama v2 dataset undergoes minimal processing, preserving the original data for subsequent processing by model builders. This initiative will provide more resources for the development and research of language models, and is expected to further advance the field of AI.