A multimodal model for visual localization of GUI commands.
CommonProductProductivityVisual LocalizationMultimodal Model
Aria-UI is a large-scale multimodal model specifically designed for visual localization of GUI commands. It employs a purely visual approach without relying on auxiliary inputs, accommodating a variety of planning commands and generating diverse, high-quality command samples to adapt to different tasks. Aria-UI has set new records in both offline and online agent benchmarks, surpassing baselines that rely solely on visual inputs or AXTree.