AI2 recently released an open source dataset named Dolma, which contains 3 trillion tokens. Dolma's data will serve as the foundation for AI2's developing Open Language Model OLMo, expected to launch in early 2024. The Dolma dataset comes from a wide range of sources, including web content, academic publications, code, and books, making it the largest publicly available dataset of its kind.