DCLM
Comprehensive framework for building and training large language models
PremiumNewProductProgrammingLarge language modelsDataset construction
DataComp-LM (DCLM) is a comprehensive framework for building and training large language models (LLMs), providing standardized corpora, efficient pre-training recipes based on the open_lm framework, and over 50 evaluation methods. DCLM supports researchers in experimenting with different data set construction strategies at different computational scales, from 411M to 7B parameter models. DCLM significantly improves model performance through optimized dataset design and has already facilitated the creation of multiple high-quality datasets that outperform all open datasets at different scales.
DCLM Visit Over Time
Monthly Visits
515580771
Bounce Rate
37.20%
Page per Visit
5.8
Visit Duration
00:06:42