DataComp-LM (DCLM) is a comprehensive framework for building and training large language models (LLMs), providing standardized corpora, efficient pre-training recipes based on the open_lm framework, and over 50 evaluation methods. DCLM supports researchers in experimenting with different data set construction strategies at different computational scales, from 411M to 7B parameter models. DCLM significantly improves model performance through optimized dataset design and has already facilitated the creation of multiple high-quality datasets that outperform all open datasets at different scales.