Crawl4LLM

An efficient web crawler for LLM pre-training, focused on crawling high-quality web data effectively.

CommonProductProgrammingLLMWeb Crawler
Crawl4LLM is an open-source web crawling project designed to provide an efficient data crawling solution for the pre-training of Large Language Models (LLMs). It helps researchers and developers obtain high-quality training corpora through intelligent selection and crawling of web data. The tool supports various document scoring methods and allows flexible adjustment of crawling strategies based on configurations to meet different pre-training needs. Developed in Python, the project boasts good scalability and ease of use, making it suitable for both academic research and industrial applications.
Visit

Crawl4LLM Visit Over Time

Monthly Visits

502571820

Bounce Rate

37.10%

Page per Visit

5.9

Visit Duration

00:06:29

Crawl4LLM Visit Trend

Crawl4LLM Visit Geography

Crawl4LLM Traffic Sources

Crawl4LLM Alternatives