ZhipuAI recently released its latest foundational large model, GLM-4-Plus, showcasing powerful visual capabilities comparable to OpenAI's GPT-4, and announced its availability on August 30th. This breakthrough not only marks a significant leap in domestic AI technology but also brings users an unprecedented intelligent experience.

Key Update Highlights:

Language foundation model GLM-4-Plus: Achieved a qualitative leap in language parsing, instruction execution, and long-text processing capabilities, maintaining a leading position in international competition.

Text-to-image model CogView-3-Plus: Performance on par with top-tier industry models like MJ-V6 and FLUX.

Image/video understanding model GLM-4V-Plus: Excels not only in image understanding but also in video understanding based on temporal sequence analysis. This model will be launched on the open platform bigmodel.cn and become the first universal video understanding model API in China.

Video generation model CogVideoX: Following the release and open-sourcing of the 2B version, the 5B version has also been officially open-sourced, significantly enhancing performance and becoming a top choice among open-source video generation models.

The cumulative downloads of Zhipu's open-source models have exceeded 20 million times, making a significant contribution to the prosperity of the open-source community.

image.png

GLM-4-Plus excels in multiple key areas. In terms of language capabilities, the model has reached an international leading level in understanding, instruction following, and long-text processing, comparable to GPT-4 and Llama3.1 with 405B parameters. Notably, GLM-4-Plus significantly enhances long-text inference through a precise mix of short and long text data strategies.

image.png

In the field of visual intelligence, GLM-4V-Plus demonstrates excellent image and video understanding capabilities. It not only has temporal awareness but also can process and understand complex video content. Notably, the model will be launched on Zhipu's open platform, becoming the first universal video understanding model API in China, providing powerful tools for developers and researchers.

image.png

For example, if you give it a video like this and ask what the player in the green jersey did throughout the video, it can accurately describe the actions of the player and even tell you the highlight moments of the video at what second:

image.png

Screenshot from the official

ZhipuAI has also made breakthroughs in the generative field. CogView-3-Plus is close to the best models like MJ-V6 and FLUX in text-to-image performance. Meanwhile, the video generation model CogVideoX has launched a more powerful 5B version, considered the best choice among current open-source video generation models.

image.png

Most anticipated is the upcoming "video call" feature of Zhipu's Qingyan APP, the first AI video call feature open to C-end users in China. This feature spans text, audio, and video modalities and has real-time inference capabilities. Users can have smooth conversations with AI, even with frequent interruptions, and the AI can react quickly.

Even more astonishing is that by simply turning on the camera, the AI can see and understand what the user sees and accurately execute voice commands.

This revolutionary video call feature will be launched on August 30th, initially available to some Qingyan users and accepting external applications. This innovation not only showcases ZhipuAI's technological prowess but also opens up new possibilities for the deep integration of artificial intelligence with daily life.

Reference: https://mp.weixin.qq.com/s/Ww8njI4NiyH7arxML0nh8w