Microsoft recently released OmniParser V2.0, a new parsing tool designed to convert user interface (UI) screenshots into structured formats. OmniParser can enhance the performance of UI agents based on large language models (LLM), helping users better understand and interact with information on the screen.

The tool's training dataset includes an interactive icon detection dataset, carefully selected and automatically annotated from popular web pages to highlight clickable and actionable areas. Additionally, there is an icon description dataset aimed at linking each UI element with its corresponding functionality.

QQ_1739759294065.png

In version V2.0, OmniParser has undergone significant improvements, with an updated dataset that is larger and cleaner, and the description and localization of icons improved by 60%. According to tests, the average latency of this version has also been greatly reduced, approximately 0.6 seconds per frame on A100 devices and 0.8 seconds per frame on a single 4090 graphics card. In terms of performance, OmniParser achieved an average accuracy of 39.6 in the ScreenSpot Pro test.

Users can control a Windows 11 virtual machine using the OmniTool, which works in conjunction with OmniParser, allowing users to select suitable visual models. Currently, OmniTool supports various large language models, including multiple versions of OpenAI, DeepSeek (R1), Qwen (2.5VL), and Anthropic Computer Use, making it convenient for users to perform various operations.

OmniParser is designed to convert unstructured screenshot images into a structured list of elements, including the locations of interactive areas and potential functional descriptions of icons. Users of this tool need to possess basic analytical skills and critical thinking, as while OmniParser can extract information, the final judgment must still be made by the user. This tool can be used for various types of screenshots, including PC and mobile interfaces, demonstrating strong adaptability.

However, it is also important to note the limitations of OmniParser. The tool does not detect harmful content in the input, so users should be cautious when providing input to ensure it does not contain harmful information. Furthermore, although OmniParser only converts screenshots to text, it can still be used to build operational graphical user interface agents. Developers using OmniParser to build and operate agents must adhere to safety standards and ethical guidelines.

Model: https://huggingface.co/microsoft/OmniParser-v2.0

Project: https://github.com/microsoft/OmniParser/tree/master

Highlights:  

🔍 OmniParser V2.0 is an intelligent parsing tool that converts UI screenshots into structured information, enhancing user experience.  

⚡ The new version has significant improvements, with average latency reduced to 0.6 seconds per frame and an accuracy rate of 39.6%.  

🔐 Users should be mindful of the safety of the input content, and developers should follow safety standards and ethical guidelines.