The open-source web crawler project Crawl4 AI has recently released version 0.4.1, bringing several major updates. The most notable feature is the new Text-Only Mode, which enhances crawling efficiency by optimizing resource loading strategies, increasing speed by 3 to 4 times.
"The core of this update is to make the crawler faster and smarter," said the project maintainer. "Especially when dealing with modern web pages, the new version shows significant advantages."
A highlight of this update is the brand-new Text-Only Mode. By disabling image loading, JavaScript execution, and GPU processing, this mode can significantly increase crawling speed. Users can enable this feature simply by setting the text_only=True parameter, making it particularly suitable for scenarios where only the text content of web pages is needed.
In response to the characteristics of modern web pages, version 0.4.1 also optimizes the content loading mechanism. The new version improves the handling of lazy-loaded content and introduces the wait_for_images parameter to ensure that images are fully loaded. Additionally, the new dynamic viewport adjustment feature (adjust_viewport_to_content) ensures that all dynamic content can be accurately captured.
To better handle infinite scrolling and other dynamically loaded pages, Crawl4AI has introduced a full-page scanning feature. Users can enable this feature by setting scan_full_page=True, along with the scroll_delay parameter to precisely control the scanning pace, simulating real user browsing behavior.
In terms of performance optimization, the new version has also improved session management. By implementing a session reuse mechanism, it avoids the overhead of repeatedly creating browser tabs, significantly reducing memory usage and enhancing overall efficiency.
This update marks an important step for Crawl4AI in the field of web data collection, providing developers with a more efficient and reliable crawling tool.