Alibaba Document Processing Model mPLUG-DocOwl 1.5: Analyze Various Documents Including Charts and Webpages Without OCR

AIbase基地

Published inAI News · 4 min read · Oct 21, 2024

269

Recently, Alibaba's AI research team has made remarkable advancements in the field of document understanding, introducing mPLUG-DocOwl1.5, a cutting-edge model that excels in document understanding tasks without OCR (Optical Character Recognition).

Traditionally, document understanding tasks often relied on OCR technology to extract text from images, which was frequently hampered by complex layouts and visual noise. In contrast, mPLUG-DocOwl1.5 bypasses this bottleneck by directly learning to understand documents from images through a novel unified structural learning framework.

The model analyzes document layouts and organizational structures across various domains, including general documents, tables, charts, web pages, and natural images. It not only accurately identifies text but also utilizes elements like spaces and line breaks when understanding document structures.

For tables, the model generates structured Markdown formats, and when parsing charts, it converts them into data tables by understanding the relationships between legends, axes, and values. Additionally, mPLUG-DocOwl1.5 has the capability to extract text from natural images.

In terms of text localization, mPLUG-DocOwl1.5 can identify and locate words, phrases, lines, and blocks, ensuring precise alignment between text and image regions. Its underlying H-Reducer architecture enhances processing efficiency by horizontally merging visual features through convolutional operations, maintaining spatial layout while reducing sequence length.

To train this model, the research team utilized two carefully selected datasets. DocStruct4M, a large-scale dataset focused on unified structural learning, and DocReason25K, which tests the model's reasoning abilities through step-by-step question answering.

Results indicate that mPLUG-DocOwl1.5 has set new records in ten benchmark tests, outperforming similar models by more than 10 points in half of the tasks. Moreover, it demonstrates excellent linguistic reasoning capabilities, capable of generating detailed step-by-step explanations for its answers.

Although mPLUG-DocOwl1.5 has made significant progress in various aspects, researchers acknowledge that there is still room for improvement, particularly in handling inconsistent or erroneous statements. In the future, the team hopes to further expand the unified structural learning framework to include more document types and tasks, advancing the development of document AI.

Paper: https://arxiv.org/abs/2403.12895

Code: https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5

Key Points:
📄 mPLUG-DocOwl1.5 is an AI model that excels in document understanding tasks without the need for OCR.
🔍 The model can analyze document layouts across various types and learn to understand directly from images.
📈 mPLUG-DocOwl1.5 has set new records in ten benchmark tests, showcasing superior linguistic reasoning capabilities.

Unsloth AI Releases 1.8-bit Quantized Kimi K2 Model, Significantly Reducing Deployment Costs

Unsloth AI quantized Moonshot AI's 1T-parameter Kimi K2 model to 1.8bit, reducing size by 80% to 245GB while maintaining performance. The MoE-based model excels in coding and reasoning, now deployable on 512GB M3Ultra devices, lowering costs. This advancement positions Kimi K2 as a GPT-4.1 competitor, benefiting SMEs and boosting open-source AI adoption in education/healthcare.....

Meta Announces World's First 1GW+ Power Supercomputer Cluster to Go Live, AI Computing Competition Rises to New Level

Meta accelerates AI infrastructure, targeting a 1GW 'Prometheus' supercomputer with 1.3M NVIDIA H100 GPUs (2 exaflops) by 2026, plus 5GW 'Hyperion' cluster. Plans $60-65B investment by 2025 for AI/data centers, competing with OpenAI/xAI. Commits to open-source and privacy despite environmental concerns.....

What is UTCP? A New Tool Calling Protocol: Let AI Agents Directly Access Tools, Reducing Latency

Global developers have introduced a universal tool calling protocol (UTCP), allowing AI agents to directly call various tools without relying on proxy servers. Compared to traditional MCP protocols, UTCP supports native interfaces such as HTTP and gRPC, significantly reducing calling latency and complexity. The protocol retains existing enterprise security measures while providing SDKs in TypeScript and Python. Developers can participate in improving the protocol through open-source projects. UTCP has the potential to open up new pathways for AI tool integration.

Cognition Acquires Windsurf AI Coding Tool, Intensifying the Competition in AI Coding!

A dramatic acquisition has recently taken place in the AI coding field: Cognition acquired Windsurf company. Previously, this company had experienced a $2.4 billion reverse talent acquisition by Google and an unsuccessful $3 billion acquisition offer from OpenAI. Windsurf generates $82 million in annual revenue, has 350 enterprise clients, and tens of thousands of daily active users. After the acquisition, Cognition will integrate Windsurf's AI development environment with its own Devin coding assistant and regain access to the Claude AI model. This deal marks another significant move in the competition.

Musk Announces a Major Update! Grok Launches New Anime AI Companion Feature, Celebrations for Otaku Players!

Musk's AI chatbot Grok introduces a new virtual companion feature, including two character options: Ani, an anime character, and Rudy, a cartoon panda. Ani supports NSFW mode. This feature is currently released in a soft launch, requiring users to enable it manually, with plans to simplify the process in the future. Additionally, it was revealed that a new character named Chad is under development, and the NSFW content toggle for voice chat mode is already supported. This innovative feature highlights the potential of AI in personalized interaction and is expected to enhance Grok's market competitiveness.

Sack Announces the Launch of Anime AI Companion Feature for Grok, Sparking Widespread Attention

Tesla's Musk unveils anime-style AI companions for xAI's Grok chatbot, including virtual characters like Ani and Bad Rudy. The feature supports voice interaction and is exclusive to $30/month SuperGrok subscribers. While popular among anime fans, it raises AI ethics concerns. Musk claims Grok4 outperforms ChatGPT, but transparency issues persist. This marks xAI's key commercialization effort.....

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Alibaba Document Processing Model mPLUG-DocOwl 1.5: Analyze Various Documents Including Charts and Webpages Without OCR

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Kimi K2 Sweeps Globally! Open Source AI Tops OpenRouter, Surpassing XAI in Market Share

Claude Major Upgrade! One-Click Link to MCP Tool Directory, AI Workflow Efficiency Soars