Information

Latest AI News

Explore AI Frontiers, Master Industry Trends

AI Daily Brief

Your Daily AI Brief - Never Miss What's Next

Information

AI Product Finder

Smart Product Discovery - Comprehensive Market Intelligence

AI Product Rankings

AI Product Power Rankings - Performance, Buzz & Trends

AI Product Submit

Submit Your AI Product - Amplify Reach & Drive Growth

Tools

AI Tools Directory

Discover The Best AI Websites & Tools

Information

AI Models Finder

Comprehensive AI Models Collection for All Your Development & Research Needs

LLM Leaderboard

AI LLM Power Rankings - Performance, Buzz & Trends

Model Providers

Discover Trusted AI Model Partners - Guaranteed Reliable Support

Submit Your Model

Submit Your Model Info & Services - Precision Marketing & User Targeting

Tools

Compare LLMs

Multi-Dimensional Large Model Comparison - Find Your Perfect Match

LLM Cost Calculator

Calculate AI Model Costs Accurately - Optimize Your Budget

LLM Arena

Multi-Model Real-Time Evaluation & Quick Output Comparison

Information

MCP Servers

Discover Popular AI-MCP Services - Find Your Perfect Match Instantly

MCP Client

Easy MCP Client Integration - Access Powerful AI Capabilities

MCP Case Tutorials

Master MCP Usage - From Beginner to Expert

MCP Ranking

Top MCP Service Performance Rankings - Find Your Best Choice

MCP Service Submission

Publish & Promote Your MCP Services

Tools

MCP Playground

Test MCP Services Freely - Quick Online Experience

MCP Inspector

Quick MCP Service Testing - Fast Deployment

GEO Services

Achieve Dominant Visibility in AI Search for Your Business or Brand with GEO Services

AI Search Visibility Checker

Detect brand's visibility on AI platforms

Tools

AI Model Compatibility Checker

Free PC Hardware Test for DeepSeek & Llama

AI Deployment Calculator

Enter Your Large Model Computing Requirements for Instant GPU, Memory & Server Configuration Recommendations

AI Tutorial

Information

AI Dataset Collection

Large-scale datasets and benchmarks for training, evaluating, and testing models to measure

Tools

Intelligent Document Recognition

Comprehensive Text Extraction and Document Processing Solutions for Users

Microsoft Launches New Model OmniParser: Understanding Screenshot Content Instantly with GPT-4V

AIbase基地

Published inAI News · 6 min read · Oct 25, 2024

1.0k

Remember the so-called "image-to-text" marvel, GPT-4V? It can understand picture content and even perform tasks based on images, a true boon for the lazy! But it has a fatal flaw: poor eyesight!

Imagine asking GPT-4V to press a button for you, but it acts like a "screen blind man," tapping everywhere, isn't that infuriating?

Today, I introduce to you a tool that can improve GPT-4V's eyesight—OmniParser! This is a new model released by Microsoft, designed to tackle the challenges of automatic interaction with graphical user interfaces (GUIs).

What does OmniParser do?

Simply put, OmniParser acts as a "screen translator," converting screenshots into a "structured language" that GPT-4V can understand. It combines a fine-tuned interactive icon detection model, a fine-tuned icon description model, and the output of an OCR module.

This combination generates a structured, DOM-like representation of the UI, along with screenshots that cover potential interactive element bounding boxes. Researchers first created an interactive icon detection dataset using popular web pages and icon description datasets. These datasets were used to fine-tune specialized models: one for detecting interactive areas on the screen and another for extracting the functional semantics of detected elements.

Specifically, OmniParser will:

Identify all interactive icons and buttons on the screen and mark them with boxes, assigning each box a unique ID.

Describe the function of each icon in text, such as "settings," "minimize." Recognize text on the screen and extract it.

This way, GPT-4V can clearly know what's on the screen, what each thing does, and to press which button, just tell it the ID.

How impressive is OmniParser?

Researchers have put OmniParser through various tests, and it truly improves GPT-4V's "eyesight"!

In the ScreenSpot test, OmniParser significantly boosted GPT-4V's accuracy, even surpassing some models specifically trained for graphical interfaces. For example, on the ScreenSpot dataset, OmniParser's accuracy increased by 73%, outperforming models that rely on underlying HTML parsing. Notably, combining the local semantics of UI elements led to a significant improvement in prediction accuracy—using OmniParser's output, GPT-4V's correct icon labeling increased from 70.5% to 93.8%.

In the Mind2Web test, OmniParser enhanced GPT-4V's performance in web browsing tasks, with accuracy even surpassing GPT-4V assisted by HTML information.

In the AITW test, OmniParser also significantly improved GPT-4V's performance in mobile navigation tasks.

What are OmniParser's shortcomings?

Although OmniParser is powerful, it also has some minor flaws, such as:

It can get confused with repetitive icons or text, requiring more detailed descriptions to differentiate.

Sometimes the boxes are not drawn accurately, leading GPT-4V to press the wrong location.

Occasionally, it misunderstands the icons, needing context to describe more accurately.

However, researchers are working hard to improve OmniParser, and it is expected to become increasingly powerful, eventually becoming GPT-4V's best partner!

Model experience: https://huggingface.co/microsoft/OmniParser

Paper link: https://arxiv.org/pdf/2408.00203

Official introduction: https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/

Key points:

✨OmniParser helps GPT-4V better understand screen content, enabling more accurate task execution.

🔍OmniParser has performed exceptionally well in various tests, proving its effectiveness.

🛠️OmniParser still has areas for improvement, but the future looks promising.

GPT-4V OmniParser GraphicalUserInterface OCRModule

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

OpenAI CEO Announces in Person! Why GPT-6 Will Be Renamed to GPT-6-7, and What Lies Behind This Move!

OpenAI CEO renamed GPT-6 to 'GPT-6-7', sparking speculation. The timing coincides with Dictionary.com choosing '67' as 2025's word of the year, adding cultural intrigue.....

Oct 31, 2025

120

Release of the New Generation AI Video Generation Model LTX-2: One-Click Generation of High-Quality Narrative Videos

Lightricks' LTX-2 AI model generates 20-second 4K narrative videos with synchronized visuals, audio, and lip-sync in a single diffusion process, enhancing video creation efficiency.....

Oct 31, 2025

130

Co-founder of Zhiyuan Robots: After GPT-6, the Initial AGI Will Emerge; AI Is Entering the Era of Physical Intelligence

At the Midea Group Visionaries Conference, Steve Zhou, co-founder of Zhiyuan Robots, predicted that artificial intelligence is rapidly moving toward general intelligence (AGI), which may be initially achieved after GPT-6. He reviewed the development of AI over the past decade, from the application of computer vision in 2015 to the emergence of an AGI prototype in 2025, highlighting the rapid progress.

Oct 31, 2025

120

OpenAI Launches Aardvark: An Intelligent Security Research Assistant to Enhance Software Protection

OpenAI has launched Aardvark, an intelligent security assistant based on GPT-5, to help developers and security teams efficiently address the challenge of thousands of new vulnerabilities each year. The tool continuously analyzes source code, automatically identifies vulnerabilities, assesses risks, prioritizes them, and provides remediation solutions, significantly improving the efficiency of software security protection.

Oct 31, 2025

OpenAI launches gpt-oss-safeguard: an open-source AI safety model that can be updated in real time

OpenAI releases the open-source safety model gpt-oss-safeguard, providing a flexible and transparent AI safety classification solution. This kit includes dual versions of 120 and 20, and uses the Apache 2.0 open source license, supporting free modification and integration. It innovatively realizes real-time policy interpretation functionality, which can adapt to changes in security rules without retraining, significantly reducing system maintenance costs and response latency.

Oct 31, 2025

100

Cursor 2.0 Makes a Stunning Debut! Self-Developed Model Composer is 4 Times Faster, 8 AI Agents Work in Parallel for Coding, Developer Efficiency Sees a Nuclear-Level Upgrade

Cursor 2.0 introduces the Composer coding model and multi-agent interface, evolving from a code completion tool to a collaborative development platform. It addresses delays, confusion, and bottlenecks in complex projects, hailed as the ultimate form of agent-based programming.....

Oct 30, 2025

250

OpenAI launches two new open-source secure reasoning models

OpenAI releases open-source safety models gpt-oss-safeguard-120b and 20b, advancing AI security and reliability.....

Oct 30, 2025

120

OpenAI Launches New Security Model gpt-oss-safeguard to Help the AI Field Flexibly Address Risks

OpenAI releases open-source AI safety models gpt-oss-safeguard-120b/20b under Apache 2.0 license. They enable flexible, customizable safety policy reasoning for AI deployment.....

Oct 30, 2025

100

Chat with GPT to edit photos? Adobe partners with OpenAI, Photoshop officially integrates ChatGPT, further lowering the barrier to creativity!

Adobe integrates with OpenAI's ChatGPT, enabling users to generate, edit, and export images via conversational commands in Photoshop and Adobe Express, eliminating the need for software expertise.....

Oct 29, 2025

100

360 Launches the World's First L2-L4 Full-Stack AI Agent Platform! The Era of Ready-to-Use AI Transformation for Government and Enterprises Has Begun

360 Group launches an enterprise-level agent platform with the world's first L2-L4 agent OS and upgraded SEAF Agent Factory, offering one-stop AI solutions to accelerate industrial AI adoption.....

Oct 29, 2025

190

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Deployment Calculator

AI Dataset Collection

Intelligent Document Recognition

Microsoft Launches New Model OmniParser: Understanding Screenshot Content Instantly with GPT-4V

AIbase基地

This article is from AIbase Daily

AI News Recommendations

OpenAI CEO Announces in Person! Why GPT-6 Will Be Renamed to GPT-6-7, and What Lies Behind This Move!

Release of the New Generation AI Video Generation Model LTX-2: One-Click Generation of High-Quality Narrative Videos

Co-founder of Zhiyuan Robots: After GPT-6, the Initial AGI Will Emerge; AI Is Entering the Era of Physical Intelligence

OpenAI Launches Aardvark: An Intelligent Security Research Assistant to Enhance Software Protection

OpenAI launches gpt-oss-safeguard: an open-source AI safety model that can be updated in real time

Cursor 2.0 Makes a Stunning Debut! Self-Developed Model Composer is 4 Times Faster, 8 AI Agents Work in Parallel for Coding, Developer Efficiency Sees a Nuclear-Level Upgrade

OpenAI launches two new open-source secure reasoning models

OpenAI Launches New Security Model gpt-oss-safeguard to Help the AI Field Flexibly Address Risks

Chat with GPT to edit photos? Adobe partners with OpenAI, Photoshop officially integrates ChatGPT, further lowering the barrier to creativity!

360 Launches the World's First L2-L4 Full-Stack AI Agent Platform! The Era of Ready-to-Use AI Transformation for Government and Enterprises Has Begun

GEO Services