MINT-1T

A multimodal dataset comprising one trillion tokens and 3.4 billion images.

PremiumNewProductOpenSourceMultimodalDataset
MINT-1T is a multimodal dataset open-sourced by Salesforce AI, containing one trillion text tokens and 3.4 billion images, making it ten times larger than existing open-source datasets. It includes not only HTML documents but also PDF documents and ArXiv papers, enriching the dataset's diversity. The construction of MINT-1T involves multiple data collection, processing, and filtering steps to ensure high quality and diversity of the data.
Visit

MINT-1T Visit Over Time

Monthly Visits

8724

Bounce Rate

53.42%

Page per Visit

1.4

Visit Duration

00:02:06

MINT-1T Visit Trend

MINT-1T Visit Geography

No Geography Data

MINT-1T Traffic Sources

MINT-1T Alternatives