Mobile phones, tablets, computers, TVs - with screens proliferating and operations becoming increasingly complex, does it leave you feeling overwhelmed? Apple has recently dropped a bombshell - Ferret-UI2, an ultra-powerful UI understanding model, claiming it will unify the market!
This isn't just hot air; Ferret-UI2 aims to become a true hexagon warrior, capable of understanding user interfaces across various platforms, be it iPhone, Android, iPad, web, or Apple TV, it can handle them all with ease.
One of the standout features of Ferret-UI2 is its support for multiple platforms. Unlike Ferret-UI which is limited to mobile platforms, Ferret-UI2 can understand UI screens from tablets, web, and smart TVs. This multi-platform support allows it to adapt to today's diverse device ecosystems, offering users a wider range of application scenarios.
To enhance UI perception capabilities, Ferret-UI2 introduces dynamic high-resolution image encoding technology and employs an enhanced method called "adaptive grid." Through this method, Ferret-UI2 can maintain perception abilities at the original resolution of UI screenshots, more accurately identifying visual elements and their relationships.
Additionally, Ferret-UI2 utilizes high-quality training data to learn basic and advanced tasks. For basic tasks, Ferret-UI2 converts simple references and positioning data into conversational forms, enabling the model to establish a basic understanding of various UI screens. For more user-experience-focused advanced tasks, Ferret-UI2 employs a **GPT-4o-based "token set visual prompt"** technique to generate training data, replacing previous methods' simple click instructions with single-step user-centric interactions.
To evaluate Ferret-UI2's performance, researchers constructed 45 benchmark tests covering five platforms, including six basic tasks and three advanced tasks for each platform. They also used public benchmarks like GUIDE and GUI-World. Results show that Ferret-UI2 outperforms Ferret-UI in all tested benchmarks, especially making significant strides in advanced tasks, demonstrating its versatility in handling cross-platform UI understanding tasks.
Ablation studies further indicate that the architectural and dataset improvements of Ferret-UI2 both contribute to performance enhancements, with the new dataset having a more significant impact on more challenging tasks. Additionally, Ferret-UI2 excels in cross-platform transfer learning, particularly showing good generalization capabilities between iPhone, iPad, and Android platforms.
Model address: https://huggingface.co/jadechoghari/Ferret-UI-Llama8b
Paper address: https://arxiv.org/pdf/2410.18967