Recently, Alibaba's AI research team has made remarkable advancements in the field of document understanding, introducing mPLUG-DocOwl1.5, a cutting-edge model that excels in document understanding tasks without OCR (Optical Character Recognition).
Traditionally, document understanding tasks often relied on OCR technology to extract text from images, which was frequently hampered by complex layouts and visual noise. In contrast, mPLUG-DocOwl1.5 bypasses this bottleneck by directly learning to understand documents from images through a novel unified structural learning framework.
The model analyzes document layouts and organizational structures across various domains, including general documents, tables, charts, web pages, and natural images. It not only accurately identifies text but also utilizes elements like spaces and line breaks when understanding document structures.
For tables, the model generates structured Markdown formats, and when parsing charts, it converts them into data tables by understanding the relationships between legends, axes, and values. Additionally, mPLUG-DocOwl1.5 has the capability to extract text from natural images.
In terms of text localization, mPLUG-DocOwl1.5 can identify and locate words, phrases, lines, and blocks, ensuring precise alignment between text and image regions. Its underlying H-Reducer architecture enhances processing efficiency by horizontally merging visual features through convolutional operations, maintaining spatial layout while reducing sequence length.
To train this model, the research team utilized two carefully selected datasets. DocStruct4M, a large-scale dataset focused on unified structural learning, and DocReason25K, which tests the model's reasoning abilities through step-by-step question answering.
Results indicate that mPLUG-DocOwl1.5 has set new records in ten benchmark tests, outperforming similar models by more than 10 points in half of the tasks. Moreover, it demonstrates excellent linguistic reasoning capabilities, capable of generating detailed step-by-step explanations for its answers.
Although mPLUG-DocOwl1.5 has made significant progress in various aspects, researchers acknowledge that there is still room for improvement, particularly in handling inconsistent or erroneous statements. In the future, the team hopes to further expand the unified structural learning framework to include more document types and tasks, advancing the development of document AI.
Paper: https://arxiv.org/abs/2403.12895
Code: https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5
Key Points:
📄 mPLUG-DocOwl1.5 is an AI model that excels in document understanding tasks without the need for OCR.
🔍 The model can analyze document layouts across various types and learn to understand directly from images.
📈 mPLUG-DocOwl1.5 has set new records in ten benchmark tests, showcasing superior linguistic reasoning capabilities.