MINT-1T

A multimodal dataset comprising one trillion tokens and 3.4 billion images.

PremiumNewProductOpenSourceMultimodalDataset
MINT-1T is a multimodal dataset open-sourced by Salesforce AI, containing one trillion text tokens and 3.4 billion images, making it ten times larger than existing open-source datasets. It includes not only HTML documents but also PDF documents and ArXiv papers, enriching the dataset's diversity. The construction of MINT-1T involves multiple data collection, processing, and filtering steps to ensure high quality and diversity of the data.
Visit

MINT-1T Visit Over Time

Monthly Visits

13062

Bounce Rate

80.04%

Page per Visit

1.3

Visit Duration

00:00:48

MINT-1T Visit Trend

MINT-1T Visit Geography

MINT-1T Traffic Sources

MINT-1T Alternatives