Nemotron-CC

Transforms Common Crawl into a refined long-term pre-training dataset.

CommonProductProgrammingArtificial IntelligenceDataset
Nemotron-CC is a dataset of 6.3 trillion tokens based on Common Crawl. It integrates classifiers, rewrites synthetic data, and reduces reliance on heuristic filters to convert English Common Crawl into a long-term pre-training dataset with 6.3 trillion tokens, 4.4 trillion of which are globally de-duplicated raw tokens, and 1.9 trillion are synthetically generated tokens. This dataset strikes a better balance between accuracy and data volume, making it significant for training large language models.
Visit

Nemotron-CC Visit Over Time

Monthly Visits

12788

Bounce Rate

34.46%

Page per Visit

2.2

Visit Duration

00:01:49

Nemotron-CC Visit Trend

Nemotron-CC Visit Geography

Nemotron-CC Traffic Sources

Nemotron-CC Alternatives