DCLM

Comprehensive framework for building and training large language models

PremiumNewProductProgrammingLarge language modelsDataset construction
DataComp-LM (DCLM) is a comprehensive framework for building and training large language models (LLMs), providing standardized corpora, efficient pre-training recipes based on the open_lm framework, and over 50 evaluation methods. DCLM supports researchers in experimenting with different data set construction strategies at different computational scales, from 411M to 7B parameter models. DCLM significantly improves model performance through optimized dataset design and has already facilitated the creation of multiple high-quality datasets that outperform all open datasets at different scales.
Visit

DCLM Visit Over Time

Monthly Visits

499904316

Bounce Rate

37.31%

Page per Visit

5.8

Visit Duration

00:06:52

DCLM Visit Trend

DCLM Visit Geography

DCLM Traffic Sources

DCLM Alternatives