DCLM-baseline

High-performance language model benchmark dataset

CommonProductProgrammingNatural language processingLanguage model
DCLM-baseline is a pretraining dataset for language model benchmarking, containing 4T tokens and 3B documents. It is curated from the Common Crawl dataset after a careful planning of data cleaning, filtering, and deduplication steps, aiming to demonstrate the importance of data curation in training efficient language models. The dataset is only for research purposes and should not be used in production environments or for training domain-specific models, such as those for code and mathematics.
Visit

DCLM-baseline Visit Over Time

Monthly Visits

19075321

Bounce Rate

45.07%

Page per Visit

5.5

Visit Duration

00:05:32

DCLM-baseline Visit Trend

DCLM-baseline Visit Geography

DCLM-baseline Traffic Sources

DCLM-baseline Alternatives