MINT-1T
A multimodal dataset comprising one trillion tokens and 3.4 billion images.
PremiumNewProductOpenSourceMultimodalDataset
MINT-1T is a multimodal dataset open-sourced by Salesforce AI, containing one trillion text tokens and 3.4 billion images, making it ten times larger than existing open-source datasets. It includes not only HTML documents but also PDF documents and ArXiv papers, enriching the dataset's diversity. The construction of MINT-1T involves multiple data collection, processing, and filtering steps to ensure high quality and diversity of the data.
MINT-1T Visit Over Time
Monthly Visits
33892
Bounce Rate
54.66%
Page per Visit
1.6
Visit Duration
00:02:04