At the 2024 Beijing Cultural Forum, the Beijing Academy of Artificial Intelligence (BAAI) announced the official release of the new generation Chinese Internet Corpus CCI3.0 (Chinese Corpora Internet), further promoting data co-construction and sharing. CCI3.0 includes a dataset of 1000GB and a high-quality subset CCI3.0-HQ of 498GB, marking another significant update following the initial open-source release of CCI1.0 in November 2023 and the release of CCI2.0 in April 2024.

Since its first open-source release, the CCI series datasets have been downloaded over 40,000 times, serving more than 500 enterprises and institutions in their large-model R&D, effectively supporting the development of China's artificial intelligence industry ecosystem.

WeChat Screenshot_20240925135352.png

Features of CCI3.0 include:

  1. Expanded scale, broad sources: CCI3.0 includes over 268 million web pages, covering news, social media, blogs, and other fields. Compared to CCI2.0, the data scale of CCI3.0 has nearly doubled, with data sources increasing to over 20, significantly enhancing data coverage and representativeness.

  2. Fine-grained annotation, empowering applications: CCI3.0 has conducted detailed classification and marking of raw data in more than 10 dimensions, including grammar, syntax, and educational levels, to filter out high-value data. Additionally, CCI3.0-HQ is a high-quality subset derived from automatically labeled samples based on a 70B model, further optimized through a small-scale quality model, better meeting the needs of different industries and application scenarios.

  3. Significant results, better understanding of Chinese: In comparative experiments where a 500M model was trained from scratch on 100B data, CCI3.0 outperformed other datasets in both standalone Chinese corpus training and mixed Chinese-English corpus training, with even more significant results for CCI3.0-HQ.

The BAAI expressed that it will continue to collaborate with the industry ecosystem to promote the co-construction and sharing of the corpus, building large-scale, high-quality, high-knowledge-density Chinese datasets, and making greater contributions to the development of China's artificial intelligence industry.

CCI3.0 Download Links

Flopsera:

https://open.flopsera.com/flopsera-open/data-details/BAAI-CCI3

Huggingface: https://huggingface.co/datasets/BAAI/CCI3-Data

Datahub:

https://data.baai.ac.cn/details/BAAI-CCI3