Mobile-Agent

Autonomous Multi-Modal Mobile Device Agent

CommonProductProductivityAutonomousMulti-Modal
Mobile-Agent is an autonomous multi-modal mobile device agent that leverages Multi-Modal Large Language Model (MLLM) technology. Firstly, it utilizes visual perception tools to accurately recognize and locate visual and textual elements on the front-end interface of applications. Based on the perceived visual environment, it autonomously plans and decomposes complex operational tasks and navigates mobile applications through step-by-step operations. Unlike previous solutions that relied on application-specific XML files or mobile system metadata, Mobile-Agent's vision-centric approach offers greater adaptability in various mobile operational environments, eliminating the need for customization to specific systems. To evaluate the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. Experimental results show that Mobile-Agent achieved significant accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent was still able to fulfill the requirements.
Visit

Mobile-Agent Alternatives