Nemotron-CC is a dataset of 6.3 trillion tokens based on Common Crawl. It integrates classifiers, rewrites synthetic data, and reduces reliance on heuristic filters to convert English Common Crawl into a long-term pre-training dataset with 6.3 trillion tokens, 4.4 trillion of which are globally de-duplicated raw tokens, and 1.9 trillion are synthetically generated tokens. This dataset strikes a better balance between accuracy and data volume, making it significant for training large language models.