cogagent-9b-20241220

CogAgent-9B-20241220 is a GUI agent model based on visual language models.

CommonProductProgrammingvisual language modelGUI agent

The CogAgent-9B-20241220 model is developed on the GLM-4V-9B bilingual open-source visual language model. Through data collection and optimization, multi-stage training, and strategy improvements, it has made significant advancements in GUI perception, inference prediction accuracy, action space completeness, and task generalization capabilities. This model supports bilingual interaction (Chinese and English) and can handle screenshots and language input. The current version has been implemented in ZhipuAI's GLM-PC product, aimed at helping researchers and developers progress in the study and application of visual language model-based GUI agents.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Deployment Calculator

AI Dataset Collection

Intelligent Document Recognition

cogagent-9b-20241220

cogagent-9b-20241220 Visit Over Time

cogagent-9b-20241220 Visit Trend

cogagent-9b-20241220 Visit Geography

cogagent-9b-20241220 Traffic Sources

cogagent-9b-20241220 Alternatives

cogagent-9b-20241220 — CogAgent-9B-20241220 is a GUI agent model based on visual language models.

CogAgent — An open-source end-to-end visual language model (VLM) based GUI agent

Dria-Agent-α — Dria-Agent-α is a large language model tool interaction framework based on Python.

ShowUI — A vision-language-action model designed for GUI visual agents.

Aria-UI — A multimodal model for visual localization of GUI commands.

POINTS-1-5-Qwen-2-5-7B-Chat — A leading visual language model that supports bilingual communication and high-quality control, available for free.

InternVL2_5-8B-MPO-AWQ — A multimodal large language model enhancing visual and linguistic interaction capabilities.

AgentCPM-GUI — An open-source mobile GUI intelligent agent that supports both Chinese and English application operations.

Dria-Agent-a-3B — A large language model based on the Qwen2.5-Coder series, focused on agent applications.

Qwen-VL — General-purpose Visual Language Model

MouSi — Multimodal Visual Language Model

InternLM-Math-Plus — A bilingual open-source large language model (LLM) specializing in mathematical reasoning.

UI-TARS-desktop — A GUI agent application based on UI-TARS (visual language model) that allows users to control their computer using natural language.

Visual Anagrams — Visual illusions are created using a pre-trained diffusion model.

Llama3.1-8B-Chinese-Chat — An instruction-tuned language model tailored for bilingual users.

moondream — A powerful small visual language model, accessible everywhere.

CogVLM — A powerful open-source visual language model

InternVL2_5-26B-MPO — A multimodal large language model that enhances the interaction between visual and linguistic data.

Agent M — A main agent framework driven by a large language model (LLM)

UI-TARS-7B-SFT — Next-generation native GUI proxy model that seamlessly interacts with graphical user interfaces.

Trustworthy Language Model (TLM) Playground — Try Cleanlab's Trustworthy Language Model (TLM) in your browser

WebVoyager — An end-to-end web agent built on a large multimodal model

Lemur — Open-source language agent foundation model

Visual Sketchpad — A visual reasoning tool for multimodal large language models (LLMs)

InternLM-XComposer-2.5 — A Multifunctional Large Visual Language Model

InternVL — Open Source Visual Basic Model

MinMo — MinMo is a multimodal large language model designed for seamless voice interaction.

Vary — Visual Vocabulary Expansion for Large-Scale Visual Language Models

InternVL2_5-1B-MPO — A multimodal large language model that enhances integrated understanding of visual and language data.

AlphaMaze — AlphaMaze is a decoder language model focused on visual reasoning tasks, designed to address the limitations of traditional language models in visual tasks.

GEO Services