Alibaba DAMO Academy and Renmin University of China have recently jointly open-sourced a document processing model named mPLUG-DocOwl1.5, which is designed to understand document content without the need for OCR recognition and has achieved leading performance in multiple visual document understanding benchmark tests.

Structural information is crucial for understanding the semantics of rich text images, such as documents, tables, and charts. While existing multimodal large language models (MLLMs) possess text recognition capabilities, they lack a general structural understanding of rich text document images. To address this issue, mPLUG-DocOwl1.5 emphasizes the importance of structural information in visual document understanding and proposes "unified structural learning" to enhance the performance of MLLMs.

1.png

The model's "unified structural learning" covers five domains: documents, web pages, tables, charts, and natural images, including structural-aware parsing tasks and multi-granularity text localization tasks. To better encode structural information, researchers have designed a simple yet effective visual-to-text module called H-Reducer, which not only preserves layout information but also reduces the length of visual features by merging horizontally adjacent image patches through convolution, enabling large language models to more effectively understand high-resolution images.

2.png

Additionally, to support structural learning, the research team has constructed a comprehensive training set, DocStruct4M, containing 4 million samples based on publicly available datasets, including structural-aware text sequences and multi-granularity text bounding boxes. To further enhance the reasoning capabilities of MLLMs in the document domain, they have also built a reasoning fine-tuning dataset, DocReason25K, with 25,000 high-quality samples.

mPLUG-DocOwl1.5 employs a two-stage training framework, starting with unified structural learning followed by multi-task fine-tuning across multiple downstream tasks. Through this training approach, mPLUG-DocOwl1.5 has achieved state-of-the-art performance in 10 visual document understanding benchmarks, improving the SOTA performance of the 7B LLM by more than 10 percentage points in 5 benchmarks.

Currently, the code, models, and datasets for mPLUG-DocOwl1.5 have been publicly released on GitHub.

Project Address: https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5

Paper Address: https://arxiv.org/pdf/2403.12895