Jina AI has introduced two compact language models specifically designed to transform raw HTML content into clean and tidy Markdown format, freeing us from tedious web data processing.
The model, named Reader-LM, stands out for its ability to quickly and efficiently convert web content into Markdown files.
The benefits of using it include no longer needing to rely on complex rules or laborious regular expressions. These models intelligently remove clutter from web pages, such as ads, scripts, and navigation bars, ultimately presenting a well-organized Markdown format.
Reader-LM offers two models with different parameters, Reader-LM-0.5B and Reader-LM-1.5B. Although these models do not have a large number of parameters, they have been optimized for the task of HTML to Markdown conversion, with results that are surprisingly good and outperform many large language models.
Thanks to their compact and efficient design, these models can operate effectively in resource-constrained environments. Notably, Reader-LM supports multiple languages and can handle context data up to 256K tokens, making it capable of handling even complex HTML files with ease.
Unlike traditional methods that rely on regular expressions or manual setups, Reader-LM provides an end-to-end solution that automatically cleans HTML data and extracts key information.
Through comparative tests with large models like GPT-4 and Gemini, Reader-LM has demonstrated excellent performance, particularly in structure retention and Markdown syntax usage. Reader-LM-1.5B stands out in various metrics, with a ROUGE-L score of 0.72, indicating high accuracy in content generation, and a significantly lower error rate compared to similar products.
Due to its compact design, Reader-LM has a lighter hardware resource footprint, especially the 0.5B model, which can run smoothly in low-configuration environments like Google Colab. Despite its small size, Reader-LM still possesses strong long-context processing capabilities, efficiently handling large and complex web content without compromising performance.
In terms of training, Reader-LM employs a multi-stage process focused on extracting Markdown content from raw and noisy HTML.
The training process involves pairing a large amount of real web pages with synthetic data to ensure the model's efficiency and accuracy. Through a carefully designed two-stage training, Reader-LM has gradually enhanced its ability to handle complex HTML files and effectively avoided the issue of repetitive generation.
Official introduction: https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/