Shanghai Artificial Intelligence Laboratory Releases Open Source Intelligent Data Extraction Tool - MinerU

AIbase基地

Published inAI News · 3 min read · Sep 3, 2024

659

At the 2024 WAIC Scientific Frontier Main Forum, the OpenDataLab team of the Shanghai Artificial Intelligence Laboratory (Shanghai AI Lab) unveiled a new intelligent data extraction tool called MinerU. This tool aims to streamline the AI data processing workflow, assisting AI researchers in extracting high-quality data from vast documents.

MinerU is a versatile, open-source tool for document and web data extraction. It can convert multimodal PDF documents, including images, tables, and formulas, into clear, analyzable Markdown format. It also quickly parses and extracts formal content from web pages cluttered with ads and other distractions, and supports batch conversion of multiple formats such as epub, mobi, and docx into Markdown.

WeChat Screenshot_20240903140350.png

MinerU consists of two main components: Magic-PDF and Magic-Doc. Magic-PDF focuses on PDF document extraction, converting PDFs into Markdown format, and quickly identifying PDF layout elements, automatically removing non-text content while preserving the original document's structure and format. Magic-Doc handles web and ebook extraction, supporting the extraction of common web information types such as articles, forums, music, and videos, as well as ebook format conversion.

Technically, the PDF document extraction process of MinerU includes PDF document classification preprocessing, model parsing, pipeline processing, and quality inspection of PDF extraction results. It utilizes a series of models such as LayoutLMv3, YOLOv8, UniMERNet, and PaddleOCR to achieve high-quality document data extraction.

The release of MinerU not only provides AI researchers with a powerful data processing tool but also further promotes the upgrade of the entire toolchain for large-model research and application.

ModelScope Community Experience Link:

https://modelscope.cn/studios/OpenDataLab/MinerU

Open Source Code Link:

https://github.com/opendatalab/MinerU/

MinerU Open Source Model (PDF-Extract-Kit):

https://modelscope.cn/models/OpenDataLab/PDF-Extract-Kit

MinerU OpenDataLab AIDataProcessing Magic-PDF

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Honor Magic V5 Launch: Li Jian Emphasizes Open Ecosystem, Collaborating with Giants to Build the AI Future

In the media Q&A session after today's Honor Magic V5 and AI Terminal Ecosystem Launch, Honor CEO Li Jian, CFO Peng Qiuen, and Product Line President Fang Fei had in-depth discussions with the media. During the event, Honor officially announced support for the MCP and A2A protocols, and revealed that it will collaborate deeply with partners such as Alibaba, BYD, and Midea in the fields of intelligent service ecosystem, smart vehicle networking, and smart home. Honor CEO Li Jian emphasized in the conversation that 'openness' is the core philosophy of Honor. He pointed out...

Jul 3, 2025

Honor Launches a New Battle in AI Voice Technology, the World's First Edge-side Voice Large Model to Be Launched!

Honor's official Weibo account @MagicOS announced that Honor has successfully deployed the world's first edge-side voice large model. This technological advancement is not only a breakthrough for Honor, but also hailed as a 'renewal of AI voice technology'. This significant achievement will make its debut on the overseas version of the upcoming Honor Magic V5. Honor's technological innovation is the result of its in-depth efforts in the field of artificial intelligence. It is reported that Honor has published two academic papers at the prestigious international conference InterSpeech, which have attracted widespread attention from the academic community.

Jul 2, 2025

370

Open Source Magic is Here! FLUX.1 Kontext [dev] Challenges GPT-4o, Bringing Image Editing into a New Era

Jun 27, 2025

210

Baidu PaddlePaddle Releases Document Parsing Tool PP-StructureV3: PDF to Markdown Conversion at Lightning Speed

Recently, with the rapid development of large models and RAG technology, the value of structured data in intelligent systems has become increasingly prominent. Against this backdrop, how to accurately convert unstructured data such as document images and PDFs into structured data has become a key challenge that the industry urgently needs to address. In response to this situation, the PaddlePaddle team, leveraging its deep technical expertise and profound insights into user needs, has launched the new-generation document parsing tool - PP-StructureV3, providing an innovative solution for solving complex document parsing problems. Currently, many open-source solutions struggle in handling complex

Jun 18, 2025

370

ChatGPT Evolves Further! Significant Upgrades to Project Functions, PDF Export Supported in Canvas, and AI Assistant Understands You Better

OpenAI's ChatGPT has undergone a series of product feature updates, further enhancing its competitiveness in the field of productivity tools. From comprehensive upgrades to project functions to the addition of download options in Canvas, these updates have not only optimized user experience but also provided stronger work support for developers, creators, and enterprise users. Image source note: Images generated by AI. Project function upgrade: A smarter and more flexible workspace. The project function of ChatGPT has undergone major updates recently, providing users with

Jun 13, 2025

210

AI Wonder Weapon LlamaParse: Unleash PDF Tables and Documents with One Click! The Secret to Boosting Efficiency!

May 26, 2025

520

One picture gives birth to everything? AI Magic Brush 3DTown turns a single photo into a 3D city. This operation is so impressive!

May 22, 2025

740

OpenAI Introduces PDF Export Functionality for Deep Research Reports

Leading artificial intelligence company OpenAI announced the addition of a new feature to its ChatGPT Deep Research tool - one-click export of deep research reports as PDFs. This functionality not only enhances the utility of the research reports but also further promotes AI's application in enterprise environments. Highlights of the feature: Complete format retention, professional output. OpenAI's deep research tool can generate detailed reports containing references, tables, and images through multi-step web searches and information integration.

May 13, 2025

2.1k

ChatGPT Major Update! Deep Research Report Export to PDF with All Tables, Charts, and Efficiency Doubled!

May 13, 2025

310

ChatGPT Launches New PDF Export Function to Optimize the Experience of In-depth Research Reports

May 12, 2025

570

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief