Chinese Internet Corpus Resource Platform
Providing high-quality Chinese language corpus resources to assist in the pre-training of large AI models.
PremiumNewProductOthersArtificial IntelligenceCorpus
The Chinese Internet Corpus Resource Platform is a professional website hosted by the China Cybersecurity Association, aiming to provide high-quality and compliant Chinese corpus resources for the pre-training of large AI models. The platform integrates the collaborative strengths of enterprises, universities, and research units, relying on a 'co-build and share' mechanism, forming several high-quality corpora including Chinese Internet Basic Corpus 2.0, People's Daily Mainstream Value Dataset, and National Library Qing and Ming Literature Corpus. These corpora undergo strict data source validation, format cleansing, language filtering, data deduplication, content filtering, and privacy filtering to ensure the legality, authenticity, accuracy, and objectivity of the data. The resources on this platform are of significant importance for promoting national AI technology innovation and industrial development, aiding large models in better understanding and generating Chinese content, and enhancing their knowledge capability and value alignment.