Apple has recently unveiled the new generation artificial intelligence system, Ferret-UI2. This cross-platform AI assistant has made significant strides in UI element recognition, scoring 89.73 in tests, significantly ahead of GPT-4V's 77.73, showcasing exceptional performance.
The standout feature of this system lies in its intelligent understanding of user intent. Unlike traditional coordinate-based click operations, Ferret-UI2 can automatically locate and execute actions based on natural language commands from users. The research team has leveraged GPT-4V's visual capabilities to generate training data, enabling the system to better understand the spatial relationships between interface elements.
In terms of technical architecture, Ferret-UI2 employs an adaptive design, accurately recognizing UI elements across multiple platforms including iPhone, iPad, Android devices, web browsers, and Apple TV. The system is equipped with intelligent algorithms that automatically adjust image resolution and processing needs according to different platforms, ensuring local computational efficiency while maintaining information integrity.
Actual test data shows that the system performs excellently across various platforms: smooth operation on iPhone, 68% accuracy on iPad, and a success rate of 71% on Android devices. However, challenges remain in cross-device scenarios, such as switching between mobile devices and TVs or web interfaces, primarily due to differences in interface layouts across platforms.
It is worth noting that the competition in the UI interaction AI field is intensifying. Anthropic has recently upgraded Claude3.5Sonnet's UI interaction capabilities, while Microsoft has open-sourced the OmniParser tool, dedicated to converting screen content into structured data.
Apple has also introduced the CAMPHOR framework, which, through the collaboration of professional AI agents and master inference agents, further enhances the system's ability to handle complex tasks. This means that future voice assistants like Siri will be able to more intelligently complete complex tasks such as restaurant reservations without user manual intervention.
This technological breakthrough not only elevates the level of cross-device operation intelligence but also outlines a clear development roadmap for the next generation of human-computer interaction. With ongoing technological evolution, smarter and more natural human-computer interaction experiences are within reach.