The Allen Institute for Artificial Intelligence in the United States recently released an open-source dataset called Dolma, which contains 3 trillion tokens. This dataset will serve as the foundation for the Open Language Model (OLMo) being developed by AI2, with a planned release in early 2024. The data in Dolma is sourced from a wide range of materials, including web content, academic publications, code, and books. This dataset is currently the largest of its kind available publicly.