Microsoft Releases OmniParser V2.0: Converting Screenshots into Structured Formats for LLM Processing

AIbase基地

Published inAI News · 5 min read · Feb 17, 2025

714

Microsoft recently released OmniParser V2.0, a new parsing tool designed to convert user interface (UI) screenshots into structured formats. OmniParser can enhance the performance of UI agents based on large language models (LLM), helping users better understand and interact with information on the screen.

The tool's training dataset includes an interactive icon detection dataset, carefully selected and automatically annotated from popular web pages to highlight clickable and actionable areas. Additionally, there is an icon description dataset aimed at linking each UI element with its corresponding functionality.

In version V2.0, OmniParser has undergone significant improvements, with an updated dataset that is larger and cleaner, and the description and localization of icons improved by 60%. According to tests, the average latency of this version has also been greatly reduced, approximately 0.6 seconds per frame on A100 devices and 0.8 seconds per frame on a single 4090 graphics card. In terms of performance, OmniParser achieved an average accuracy of 39.6 in the ScreenSpot Pro test.

Users can control a Windows 11 virtual machine using the OmniTool, which works in conjunction with OmniParser, allowing users to select suitable visual models. Currently, OmniTool supports various large language models, including multiple versions of OpenAI, DeepSeek (R1), Qwen (2.5VL), and Anthropic Computer Use, making it convenient for users to perform various operations.

OmniParser is designed to convert unstructured screenshot images into a structured list of elements, including the locations of interactive areas and potential functional descriptions of icons. Users of this tool need to possess basic analytical skills and critical thinking, as while OmniParser can extract information, the final judgment must still be made by the user. This tool can be used for various types of screenshots, including PC and mobile interfaces, demonstrating strong adaptability.

However, it is also important to note the limitations of OmniParser. The tool does not detect harmful content in the input, so users should be cautious when providing input to ensure it does not contain harmful information. Furthermore, although OmniParser only converts screenshots to text, it can still be used to build operational graphical user interface agents. Developers using OmniParser to build and operate agents must adhere to safety standards and ethical guidelines.

Model: https://huggingface.co/microsoft/OmniParser-v2.0

Project: https://github.com/microsoft/OmniParser/tree/master

Highlights:

🔍 OmniParser V2.0 is an intelligent parsing tool that converts UI screenshots into structured information, enhancing user experience.

⚡ The new version has significant improvements, with average latency reduced to 0.6 seconds per frame and an accuracy rate of 39.6%.

🔐 Users should be mindful of the safety of the input content, and developers should follow safety standards and ethical guidelines.

Zhipu Announces Price Cuts for Multiple Large Language Models, with GLM-4-Plus Dropping 90%

Zhipu BigModel's open platform has adjusted prices for several of its model offerings. GLM-4-FlashX, for example, is now priced at just 10 RMB per 100 million tokens. Built on a powerful pre-trained base, this model boasts exceptionally fast inference speeds and functional capabilities comparable to GPT-4, excelling in data extraction, generation, and translation.

Intel Open-Sources AI Playground: Arc GPU-Powered Local AI Model Execution

Intel recently announced the open-sourcing of its AI Playground software, designed for local generative AI. AI Playground provides a powerful platform for running AI models on Intel Arc GPUs. It supports various image and video generation models, as well as Large Language Models (LLMs), significantly lowering the hardware barrier for AI applications by optimizing local computing resources. The project is available on GitHub and has attracted developers and AI enthusiasts worldwide.

Chatbot Arena, AI Benchmarking Platform, Launches New Company

Amidst the rapid growth of the AI industry, Chatbot Arena, a crowdsourced AI benchmarking project, is expanding its reach by officially launching a new company, Arena Intelligence Inc. According to Bloomberg, Chatbot Arena aims to leverage this new entity to secure more resources, significantly enhancing the platform's functionality and services. Founded in 2023, Chatbot Arena is primarily spearheaded by the University of California, Berkeley...

Gartner Report: Task-Specific AI to Outpace General-Purpose AI by 2027

A new Gartner report predicts that by 2027, enterprises will utilize task-specific AI models three times more frequently than general-purpose large language models. While acknowledging the strong language processing capabilities of general-purpose models, the report highlights their decreased accuracy in tasks requiring deep understanding of specific business domains. Consequently, businesses are increasingly focusing on customized AI models to meet their unique needs. Image note: Image generated by AI, image licensing provided by Midjourney.

Hugging Face Acquires Pollen Robotics, Ushering in a New Era for Robotics

On April 15th, Hugging Face, the renowned open-source large language model platform, announced its acquisition of Pollen Robotics, marking its official entry into the physical robotics field. While specific transaction terms remain undisclosed, the acquisition will bring approximately 20 Pollen Robotics employees to Hugging Face. This represents the company's largest personnel acquisition to date, signifying its ambition in expanding its business areas. Hugging Face's co-founder...

OpenGVLab Open-Sources InternVL3 Series of Multimodal Large Language Models

OpenGVLab has open-sourced the InternVL3 series of models, marking a new milestone in the field of Multimodal Large Language Models (MLLMs). The InternVL3 series comprises seven models ranging from 1B to 78B parameters, capable of handling text, images, and videos simultaneously, demonstrating superior overall performance.

Stanford AI Index Report: Closing Performance Gap Between US and Chinese AI Models, Alibaba Model Rises to Third Globally

The Stanford Institute for Human-Centered Artificial Intelligence (HAI), led by renowned AI scientist Fei-Fei Li, has released its latest AI Index Report 2025. In its eighth year, this authoritative report highlights the narrowing performance gap between top AI models from China and the United States, the world's two most influential AI nations. The gap has shrunk to a negligible 0.3%, down from 17.5% in 2023. The report also features a ranking of Notable Models in 2024, with...