The GLM-PC foundational model CogAgent-9B from Zhipu AI has now been open-sourced to promote the development of the large model Agent ecosystem. CogAgent-9B is a specialized Agent task model trained based on GLM-4V-9B, capable of predicting the next GUI operation based solely on a screenshot input, combined with historical actions, according to any task specified by the user. The versatility of this model allows it to be widely applied in various GUI interaction scenarios, including personal computers, mobile phones, and in-car devices.
Compared to the first version of the CogAgent model open-sourced in December 2023, CogAgent-9B-20241220 has shown significant improvements in GUI perception, inference prediction accuracy, action space completeness, task universality, and generalization. It supports bilingual interactions in both Chinese and English via screenshots. The input for CogAgent includes only the user's natural language instructions, a record of executed historical actions, and GUI screenshots, without the need for any layout information or additional element tags in text form. The output includes the thought process, a natural language description of the next action, a structured description of the next action, and a sensitivity assessment of the next action.
In performance testing, CogAgent-9B-20241220 achieved leading results across multiple datasets, showcasing its advantages in GUI localization, single-step operations, Chinese step-wise rankings, and multi-step operations. This initiative by Zhipu Technology not only advances the development of large model technology but also provides new tools and possibilities for visually impaired IT professionals.
Code:
https://github.com/THUDM/CogAgent
Model:
Huggingface: https://huggingface.co/THUDM/cogagent-9b-20241220
Modao Community: https://modelscope.cn/models/ZhipuAI/cogagent-9b-20241220