DCLM-baseline
High-performance language model benchmark dataset
CommonProductProgrammingNatural language processingLanguage model
DCLM-baseline is a pretraining dataset for language model benchmarking, containing 4T tokens and 3B documents. It is curated from the Common Crawl dataset after a careful planning of data cleaning, filtering, and deduplication steps, aiming to demonstrate the importance of data curation in training efficient language models. The dataset is only for research purposes and should not be used in production environments or for training domain-specific models, such as those for code and mathematics.
DCLM-baseline Visit Over Time
Monthly Visits
19075321
Bounce Rate
45.07%
Page per Visit
5.5
Visit Duration
00:05:32