Shanghai AI Lab Releases Open Source 'Shusheng・Wanjuan' 1.0 Multi-Modal Pre-trained Dataset
站长之家
14
Translation:
The Shanghai AI Lab, in collaboration with the Corpus Data Alliance, has released the "Bookworm・Millions" 1.0 multi-modal pre-training corpus, which includes text, image-text, and video datasets. This open-source corpus exceeds 2TB in total and has undergone fine-grained cleaning and deduplication, featuring diverse integration, meticulous processing, and ease of use with high efficiency. The release of this corpus is expected to promote the application and innovation of large models, and lower the barriers to large model technology.
© Copyright AIbase Base 2024, Click to View Source - https://www.aibase.com/news/497