FineWeb

High-quality English webpage dataset

CommonProductProgrammingNatural Language ProcessingDataset
The FineWeb dataset contains over 150 billion web pages of cleaned and deduplicated English text sourced from CommonCrawl. Designed specifically for pre-training large language models, it aims to advance the development of open-source models. The dataset has been meticulously processed and filtered to ensure high quality, making it suitable for a variety of natural language processing tasks.
Visit

FineWeb Visit Over Time

Monthly Visits

20899836

Bounce Rate

46.04%

Page per Visit

5.2

Visit Duration

00:04:57

FineWeb Visit Trend

FineWeb Visit Geography

FineWeb Traffic Sources

FineWeb Alternatives