AI2 Releases Open Source Dataset for Large Language Model Dolma Containing 3 Trillion Tokens
站长之家
64
The Allen Institute for Artificial Intelligence in the United States recently released an open-source dataset called Dolma, which contains 3 trillion tokens. This dataset will serve as the foundation for the Open Language Model (OLMo) being developed by AI2, with a planned release in early 2024. The data in Dolma is sourced from a wide range of materials, including web content, academic publications, code, and books. This dataset is currently the largest of its kind available publicly.
© Copyright AIbase Base 2024, Click to View Source - https://www.aibase.com/news/772