Nemotron-CC
Transforms Common Crawl into a refined long-term pre-training dataset.
CommonProductProgrammingArtificial IntelligenceDataset
Nemotron-CC is a dataset of 6.3 trillion tokens based on Common Crawl. It integrates classifiers, rewrites synthetic data, and reduces reliance on heuristic filters to convert English Common Crawl into a long-term pre-training dataset with 6.3 trillion tokens, 4.4 trillion of which are globally de-duplicated raw tokens, and 1.9 trillion are synthetically generated tokens. This dataset strikes a better balance between accuracy and data volume, making it significant for training large language models.
Nemotron-CC Visit Over Time
Monthly Visits
12788
Bounce Rate
34.46%
Page per Visit
2.2
Visit Duration
00:01:49