ScreenSpot-Pro: A Multimodal LLM Benchmark Tool Designed for High-Resolution Environments!

AIbase基地

Published inAI News · 6 min read · Jan 6, 2025

178

In professional environments, graphical user interface (GUI) agents face three major challenges. First, the complexity of professional applications is significantly higher than that of general software, requiring a deep understanding of intricate layouts. Second, professional tools often have higher resolutions, resulting in smaller target sizes, which decreases positioning accuracy. Finally, workflows often rely on additional tools and documentation, increasing operational complexity. These challenges highlight the necessity of developing more advanced benchmarks and solutions to enhance the performance of GUI agents in these demanding scenarios.

The current GUI localization models and benchmarks do not meet the demands of professional environments. For example, tools like ScreenSpot are primarily designed for low-resolution tasks and lack the diversity needed to accurately simulate real-world scenarios. Additionally, models like OS-Atlas and UGround perform poorly in computational efficiency, especially when targets are small or when the interface is rich in icons, often leading to failures. Furthermore, the lack of multilingual support limits the applicability of these models in global workflows. These shortcomings further emphasize the need for more comprehensive and realistic benchmarks to drive progress in this field.

To address these issues, research teams from the National University of Singapore, East China Normal University, and Hong Kong Baptist University have launched ScreenSpot-Pro, a new benchmark specifically tailored for high-resolution professional environments. This benchmark includes a dataset of 1,581 task data points from 23 industries, covering development, creative tools, CAD, scientific platforms, and office suites. It features high-resolution full-screen visuals and ensures accuracy and realism through expert annotations. ScreenSpot-Pro also provides multilingual guidance, including English and Chinese, to broaden the evaluation scope. Unlike previous efforts, ScreenSpot-Pro documents actual workflows to ensure the generation of high-quality annotations, providing effective tools for the comprehensive evaluation and development of GUI localization models.

This dataset captures real and challenging scenarios based on high-resolution images, where the target areas average only 0.07% of the total screen, showcasing the subtlety and miniaturization of GUI elements. The data is collected by professional users with extensive experience in relevant applications, using specialized tools to ensure annotation accuracy. Moreover, this dataset supports multilingual functionality, facilitating the testing of bilingual capabilities, and includes multiple workflows to capture the nuances of professional tasks. These features make it particularly beneficial for assessing and enhancing the accuracy and flexibility of GUI agents.

Analysis of existing GUI localization models using ScreenSpot-Pro reveals a significant lack of capability in handling high-resolution professional environments. The highest accuracy of OS-Atlas-7B is only 18.9%. However, the ReGround model, using an iterative approach, improved performance through multi-step fine-tuning, achieving an accuracy of 40.2%. Recognition of small components like icons shows significant difficulty, while bilingual tasks further highlight the limitations of the models. These findings emphasize the need for improved techniques to enhance contextual understanding and adaptability in complex GUI environments.

ScreenSpot-Pro sets a transformative benchmark for the evaluation of GUI agents in high-resolution professional environments. It addresses specific challenges in complex workflows and provides a diverse and precise dataset to guide innovation in GUI localization. This contribution will lay the groundwork for smarter and more efficient agents, supporting seamless execution of professional tasks and significantly enhancing productivity and innovation across various industries.

Paper: https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf

Data: https://huggingface.co/datasets/likaixin/ScreenSpot-Pro

Key Points:
🌟 **Complexity of Professional Applications**: GUI agents must handle high complexity and high resolution in professional software interfaces.
🛠️ **ScreenSpot-Pro Dataset**: Contains 1,581 tasks covering 23 professional applications, supporting multilingual evaluation.
📈 **Model Performance Improvement**: Enhances GUI localization model accuracy in high-resolution environments through multi-step fine-tuning.

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

ScreenSpot-Pro: A Multimodal LLM Benchmark Tool Designed for High-Resolution Environments!

AIbase基地

This article is from AIbase Daily