At the 2024 WAIC Scientific Frontier Main Forum, the OpenDataLab team of the Shanghai Artificial Intelligence Laboratory (Shanghai AI Lab) unveiled a new intelligent data extraction tool called MinerU. This tool aims to streamline the AI data processing workflow, assisting AI researchers in extracting high-quality data from vast documents.
MinerU is a versatile, open-source tool for document and web data extraction. It can convert multimodal PDF documents, including images, tables, and formulas, into clear, analyzable Markdown format. It also quickly parses and extracts formal content from web pages cluttered with ads and other distractions, and supports batch conversion of multiple formats such as epub, mobi, and docx into Markdown.
MinerU consists of two main components: Magic-PDF and Magic-Doc. Magic-PDF focuses on PDF document extraction, converting PDFs into Markdown format, and quickly identifying PDF layout elements, automatically removing non-text content while preserving the original document's structure and format. Magic-Doc handles web and ebook extraction, supporting the extraction of common web information types such as articles, forums, music, and videos, as well as ebook format conversion.
Technically, the PDF document extraction process of MinerU includes PDF document classification preprocessing, model parsing, pipeline processing, and quality inspection of PDF extraction results. It utilizes a series of models such as LayoutLMv3, YOLOv8, UniMERNet, and PaddleOCR to achieve high-quality document data extraction.
The release of MinerU not only provides AI researchers with a powerful data processing tool but also further promotes the upgrade of the entire toolchain for large-model research and application.
ModelScope Community Experience Link:
https://modelscope.cn/studios/OpenDataLab/MinerU
Open Source Code Link:
https://github.com/opendatalab/MinerU/
MinerU Open Source Model (PDF-Extract-Kit):
https://modelscope.cn/models/OpenDataLab/PDF-Extract-Kit