Tsinghua and Zhejiang University Launch Open Source Alternatives to GPT-4V! The Emergence of Open Source Visual Models like LLaVA and CogAgent

站长之家

Published inAI News · 1 min read · Jan 4, 2024

113

Data to be translated: Tsinghua University, Zhejiang University, and other prestigious institutions have driven the development of open-source alternatives to GPT-4V, leading to a series of high-performance open-source visual models in China. Among these, LLaVA, CogAgent, and BakLLaVA have garnered significant attention. LLaVA demonstrates capabilities close to GPT-4 in visual chatting and reasoning question-answering, while CogAgent is an improved open-source visual-language model based on CogVLM. Additionally, BakLLaVA is a Mistral7B foundational model enhanced with the LLaVA1.5 architecture, offering superior performance and commercial viability. These open-source visual models hold immense potential in the field of visual processing.

GPT-4V LLaVA CogAgent

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Zhipu AI Completes Over 1 Billion Yuan in Financing, Plans to Open Source Next-Generation Large Model

Mar 3, 2025

100

Small but Powerful! Microsoft Launches Compact Model LLaVA-Rad for Accurate Radiology Report Generation

Feb 10, 2025

2.1k

Zhipu GLM-PC Open Experience: Upgraded Multimodal Agent for Independent Computer Operations

Jan 23, 2025

4.6k

Open Experience of GLM-PC Computer Intelligent Agent Based on CogAgent!

On January 23, 2025, the world's first publicly accessible, ready-to-use computer intelligent agent, GLM-PC, was upgraded again, attracting widespread attention. GLM-PC is based on the multimodal large model CogAgent, capable of 'observing' and 'operating' the computer like a human, assisting users in efficiently completing various computer tasks.

Jan 23, 2025

3.9k

Zhipu AI Open Source Agent Task Model CogAgent-9B: Predicting Actions Through Screenshots

The GLM-PC base model CogAgent-9B under Zhipu AI has now been open-sourced to promote the development of the large model Agent ecosystem. CogAgent-9B is a specialized Agent task model trained based on GLM-4V-9B, capable of predicting the next GUI action based solely on a screenshot input, combining user-specified tasks with historical operations. This model's versatility makes it widely applicable to various GUI interaction scenarios such as personal computers, smartphones, and vehicle systems.

Dec 27, 2024

2.6k

Peking University Team Releases Multimodal Model LLaVA-o1, Inference Capabilities Comparable to GPT-o1!

Recently, research teams from Peking University announced the release of an open-source multimodal model called LLaVA-o1, which is claimed to be the first visual language model capable of spontaneous and systematic reasoning, comparable to GPT-o1. The model excels in six challenging multimodal benchmark tests, with its 11B parameter version outperforming competitors such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

Nov 19, 2024

3.7k

Microsoft Launches New Model OmniParser: Understanding Screenshot Content Instantly with GPT-4V

Oct 25, 2024

9.8k

AI Daily: New version of GPT-4o launched; Wall-facing AI open-source mobile version 'GPT-4V'; Huawei unveils new 3D digital human framework EmoTalk3D; Alibaba launches Olympic Moments poster workflow

Welcome to the 【AI Daily】 segment! Here is your daily guide to exploring the world of artificial intelligence. Every day, we present hot topics in the AI field, focusing on developers, helping you gain insights into technological trends and understand innovative AI product applications. Discover fresh AI products here: https://top.aibase.com/ 1. Developers rejoice! There are issues with AI capabilities but they can be resolved. More work needs to be done across the entire development stack, and attention should be paid to 'jagged edges' while maintaining human involvement.

Aug 7, 2024

1.0k

Wall-Facing Intelligent Open Source MiniCPM-V 2.6 Edge AI Multimodal Capabilities Comparable to GPT-4V

"MiniCPM-V2.6" is an edge-side multimodal artificial intelligence model that, with only 8B parameters, has achieved SOTA (State of the Art) results in single image, multiple images, and video understanding, all under 20B parameters, significantly enhancing edge AI's multimodal capabilities and being fully comparable to GPT-4V.

Aug 7, 2024

5.3k

Baidu Launches PaddleMIX 2.0 Multi-Modal Model Development Kit

PaddleMIX 2.0 is a multi-modal model development kit launched by Baidu, aimed at simplifying multi-modal application development and supporting scenarios such as autonomous driving, smart healthcare, and search engines. Key highlights include: 1. **Rich Model Library**: Covers images, text, video, and audio, with the addition of LLaVA series models, providing cutting-edge technology support. 2. **Full Process Development Experience**: Equipped with DataCopilot and Auto modules to simplify the training process of multi-modal models. 3. **High Performance Training Capability**

Aug 1, 2024

2.3k

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview