In professional environments, graphical user interface (GUI) agents face three major challenges. First, the complexity of professional applications is significantly higher than that of general software, requiring a deep understanding of intricate layouts. Second, professional tools often have higher resolutions, resulting in smaller target sizes, which decreases positioning accuracy. Finally, workflows often rely on additional tools and documentation, increasing operational complexity. These challenges highlight the necessity of developing more advanced benchmarks and solutions to enhance the performance of GUI agents in these demanding scenarios.
The current GUI localization models and benchmarks do not meet the demands of professional environments. For example, tools like ScreenSpot are primarily designed for low-resolution tasks and lack the diversity needed to accurately simulate real-world scenarios. Additionally, models like OS-Atlas and UGround perform poorly in computational efficiency, especially when targets are small or when the interface is rich in icons, often leading to failures. Furthermore, the lack of multilingual support limits the applicability of these models in global workflows. These shortcomings further emphasize the need for more comprehensive and realistic benchmarks to drive progress in this field.
To address these issues, research teams from the National University of Singapore, East China Normal University, and Hong Kong Baptist University have launched ScreenSpot-Pro, a new benchmark specifically tailored for high-resolution professional environments. This benchmark includes a dataset of 1,581 task data points from 23 industries, covering development, creative tools, CAD, scientific platforms, and office suites. It features high-resolution full-screen visuals and ensures accuracy and realism through expert annotations. ScreenSpot-Pro also provides multilingual guidance, including English and Chinese, to broaden the evaluation scope. Unlike previous efforts, ScreenSpot-Pro documents actual workflows to ensure the generation of high-quality annotations, providing effective tools for the comprehensive evaluation and development of GUI localization models.
This dataset captures real and challenging scenarios based on high-resolution images, where the target areas average only 0.07% of the total screen, showcasing the subtlety and miniaturization of GUI elements. The data is collected by professional users with extensive experience in relevant applications, using specialized tools to ensure annotation accuracy. Moreover, this dataset supports multilingual functionality, facilitating the testing of bilingual capabilities, and includes multiple workflows to capture the nuances of professional tasks. These features make it particularly beneficial for assessing and enhancing the accuracy and flexibility of GUI agents.
Analysis of existing GUI localization models using ScreenSpot-Pro reveals a significant lack of capability in handling high-resolution professional environments. The highest accuracy of OS-Atlas-7B is only 18.9%. However, the ReGround model, using an iterative approach, improved performance through multi-step fine-tuning, achieving an accuracy of 40.2%. Recognition of small components like icons shows significant difficulty, while bilingual tasks further highlight the limitations of the models. These findings emphasize the need for improved techniques to enhance contextual understanding and adaptability in complex GUI environments.
ScreenSpot-Pro sets a transformative benchmark for the evaluation of GUI agents in high-resolution professional environments. It addresses specific challenges in complex workflows and provides a diverse and precise dataset to guide innovation in GUI localization. This contribution will lay the groundwork for smarter and more efficient agents, supporting seamless execution of professional tasks and significantly enhancing productivity and innovation across various industries.
Paper: https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf
Data: https://huggingface.co/datasets/likaixin/ScreenSpot-Pro
Key Points:
🌟 **Complexity of Professional Applications**: GUI agents must handle high complexity and high resolution in professional software interfaces.
🛠️ **ScreenSpot-Pro Dataset**: Contains 1,581 tasks covering 23 professional applications, supporting multilingual evaluation.
📈 **Model Performance Improvement**: Enhances GUI localization model accuracy in high-resolution environments through multi-step fine-tuning.