Shanghai AI Lab Releases Open Source 'Shusheng・Wanjuan' 1.0 Multi-Modal Pre-trained Dataset

站长之家

Published inAI News · 1 min read · Aug 15, 2023

Translation: The Shanghai AI Lab, in collaboration with the Corpus Data Alliance, has released the "Bookworm・Millions" 1.0 multi-modal pre-training corpus, which includes text, image-text, and video datasets. This open-source corpus exceeds 2TB in total and has undergone fine-grained cleaning and deduplication, featuring diverse integration, meticulous processing, and ease of use with high efficiency. The release of this corpus is expected to promote the application and innovation of large models, and lower the barriers to large model technology.

Shanghai AI Lab Multi-Modal Pre-trained Dataset Open Source

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Beijing Plans to Become a Global Open Source Capital to Drive AI and Blockchain Development

Mar 21, 2025

110

Kunlun Wanwei Open-Sources Skywork R1V Visual Reasoning Chain Model

Kunlun Wanwei has officially released Skywork R1V (referred to as "R1V"), the world's first industrial-grade multimodal reasoning model. This 3.8 billion parameter model's performance is close to the well-known closed-source model DeepSeek-R1, and even surpasses it in several benchmark tests, outperforming a series of current state-of-the-art (SOTA) technologies. Kunlun Wanwei's decision to open-source R1V aims to promote technology sharing and progress, injecting new vitality into the global AI open-source community. R1V is distinguished by its superior multimodal reasoning capabilities.

Mar 18, 2025

320

128K Context Window! Mistral Unveils Mistral Small 3.1, Outperforming GPT-4o Mini in Parameters

Mistral has released Mistral Small 3.1, an open-source large language model boasting a 128K context window and superior parameter efficiency compared to GPT-4o Mini.

Mar 18, 2025

330

Tencent HunYuan 3D Open Source Day Event is Coming Soon

Mar 17, 2025

290

Remade AI Open Sources 8 Wan2.1 Effect LoRAs, Igniting a New Wave in AI Video Creation

Remade AI has released 8 open-source LoRAs for Wan2.1, significantly enhancing AI video creation capabilities and sparking excitement within the community.

Mar 13, 2025

2.6k

Introducing the Open-Source OpenAI Operator: Nanobrowser's Free AI Automation Superhero

Tired of hefty monthly OpenAI Operator subscription fees? Nanobrowser offers a powerful solution. It's a completely free and open-source tool, eliminating subscription costs entirely. Simply install the extension, configure your own LLM API key, and enjoy top-tier web automation capabilities. This 'bring your own lunch' approach is not only cost-effective but also provides complete cost transparency, putting you in control of your AI.

Mar 12, 2025

830

AI-Powered Browser? Open-Source Browser Use Takes Tech World by Storm!

Recently, the tech world and developer community have been captivated by an open-source project called Browser Use! This tool is like giving AI wings, allowing it to control browsers as naturally as humans. Using natural language, users can direct AI to automatically complete various web tasks. Its powerful automation capabilities and flexible deployment have ignited the passion of tech enthusiasts globally, creating a wave of excitement on X (formerly Twitter). Browser Use is rapidly advancing the field of AI-powered browser automation.

Mar 10, 2025

650

Cisco Launches Open-Source Organization AGNTCY to Advance AI Agent Infrastructure

Cisco recently announced the formation of AGNTCY, a new open-source organization dedicated to providing critical infrastructure for the building and collaboration of AI agents. Cisco aims to unite AI and infrastructure experts to foster the development of an open and interoperable agent internet. Image note: Image generated by AI, licensed from Midjourney. With the official launch of AGNTCY, Cisco calls on experts to actively participate and contribute.

Mar 7, 2025

1.2k

Open Source China Completes Hundreds of Millions of Yuan in Series C Financing, Accelerating AI Strategy

On March 6th, Open Source China (Open Source Consensus (Shanghai) Network Technology Co., Ltd.), a leading enterprise in the open-source technology ecosystem, announced the completion of hundreds of millions of yuan in Series C financing. This round of financing was led by Beijing Information Industry Development Investment Fund (Beijing Information Industry Fund), with Shenzhen Special Zone Daily Equity Investment Fund (Shenzhen Special Zone Daily) and Beijing Shanghe Momentum Private Equity Fund (Shanghe Momentum) following suit. Index Capital acted as the financial advisor. The funding will be used to deepen its AI strategy, expand its product matrix, promote intelligent solutions with software and hardware synergy, and facilitate the implementation of AI in industrial fields. Founder and Chairman

Mar 6, 2025

Alibaba Open-Sources New Inference Large Model QwQ-32B, Rivaling DeepSeek-R1 with Lower VRAM Requirements

Alibaba has open-sourced a new inference large language model, QwQ-32B. Benchmarks show performance comparable to DeepSeek-R1, but with significantly reduced VRAM requirements.

Mar 6, 2025

1.0k

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview