AI product navigation, opening the door to AIGC for you~

2025 Artificial Intelligence (AI) Events Timeline

A comprehensive chronicle of 2025's key milestones, technological breakthroughs, product launches, and industry developments in Artificial Intelligence (AI)

March

All (11)
​OpenAI (1)
阶跃星辰 (1)
​Mistral AI (1)
百度 (1)
Google (1)
Google DeepMind (1)
OpenAI (1)
Mistral AI (1)
Alibaba (1)
Manus (1)
THUDM (1)

🔥 gpt-4o-transcribe

gpt-4o-transcribe​OpenAI

OpenAI's newly self-developed speech model, which can be considered an upgraded version of the open-source speech-to-text model Whisper released by OpenAI two years ago. It aims to provide a lower word error rate and more powerful performance. In tests across 33 industry-standard languages, gpt-4o-transcribe showed a significant decrease in error rate compared to Whisper, especially in English, where the error rate is as low as 2.46%! OpenAI provides a demonstration website called OpenAI.fm for individual users to try it out.

Audio
Mar 21

🔥 Step-Video-TI2V

Step-Video-TI2V阶跃星辰

Step-Video-TI2V is an advanced text-to-video model developed by Shanghai Jieyue Xingchen Intelligent Technology Co., Ltd. Trained on the 30B parameter Step-Video-T2V, it can generate videos up to 102 frames long based on text and image inputs. The model's core advantages lie in its controllable motion amplitude and controllable camera movement, balancing the dynamism and stability of the generated videos. Furthermore, it excels in generating anime-style videos, making it ideal for animation creation, short video production, and other applications.

Language
Mar 20

🔥 Mistral Small 3. 1

Mistral Small 3. 1​Mistral AI

Mistral AI, a French AI startup, has released its latest open-source model, Mistral Small 3.1. Mistral-Small-3.1-24B-Base-2503 is an advanced open-source model with 24 billion parameters, supporting multilingual capabilities and long-context processing for text and vision tasks. It's the base model of Mistral Small 3.1, boasting strong multi-modal capabilities suitable for enterprise needs.

Multimodal
Mar 18

🔥 文心4.5与X1

文心4.5与X1百度

Baidu released its Wenxin 4.5 and X1 large language models, with significantly reduced prices.

Language
Mar 16

🔥 Gemma 3

Gemma 3Google

Gemma 3 is a family of lightweight, state-of-the-art open models built upon Gemini 2.0 technology and designed for on-device execution. It demonstrates superior performance among similarly sized models, supporting over 140 languages and boasting advanced text and visual reasoning capabilities. Gemma 3 offers a 128k-token context window, supports function calling for handling complex tasks, and includes quantized versions for enhanced performance and reduced computational demands. Developed with a strong emphasis on safety, it aligns with rigorous data governance and security policies to ensure responsible development and use. The launch of Gemma 3 further promotes the accessibility and application of AI technology, providing developers with powerful tools to create diverse AI applications.

Multimodal
Mar 12

🔥 Gemini Robotics

Gemini RoboticsGoogle DeepMind

Gemini Robotics is an advanced Vision-Language-Action (VLA) model built upon Gemini 2.0 and designed specifically for robotics. It brings AI into the physical world through multimodal reasoning, enabling robots to perform a wider range of real-world tasks. The model is versatile, adapting to different situations and solving diverse tasks; interactive, understanding and responding quickly to everyday language instructions; and dexterous, capable of precise manipulations such as origami or snack packaging.

Multimodal
Mar 12

🔥 OpenAI Agents SDK

OpenAI Agents SDKOpenAI

The OpenAI Agents SDK is a lightweight, easy-to-use toolkit for building agent-based AI applications. It's a production-ready upgrade from OpenAI's previous agent experimentation project, Swarm. The SDK provides a small set of fundamental building blocks, including agents (LLMs equipped with instructions and tools), handoff functionality for inter-agent task delegation, and guardrails for validating agent inputs. Combined with Python, these building blocks enable the expression of complex relationships between tools and agents, allowing for the construction of practical applications without a steep learning curve. Furthermore, the SDK includes built-in tracing capabilities to help users visualize and debug agent workflows, and to evaluate workflows or even fine-tune models for the application. Its key advantages are its practicality with a minimal set of building blocks, enabling quick learning; its out-of-the-box functionality while allowing customization of specific behaviors. It represents a significant advancement in OpenAI's work in agent technology, providing developers with an efficient and flexible tool for building agent-based AI applications.

Language
Mar 11

Mistral OCR

Mistral OCRMistral AI

Mistral OCR is an Optical Character Recognition (OCR) API focusing on document understanding. It understands every element in a document with unparalleled accuracy and cognitive capabilities, including text, images, tables, and equations. This technology extracts structured text and image content from image and PDF inputs, supporting multimodal document processing and leading the industry in complex document understanding. Its significance lies in unlocking the collective intelligence of digital information, transforming vast amounts of organizational data stored as documents into actionable knowledge, and driving innovation.

Multimodal
Mar 6

🔥 QwQ-32B

QwQ-32BAlibaba

QwQ-32B is a 32-billion-parameter inference model enhanced through large-scale reinforcement learning (RL) to enable deep thinking and complex reasoning. It integrates agent-related capabilities, allowing it to engage in critical thinking while utilizing tools and adjusting its reasoning process based on environmental feedback. The model demonstrates exceptional performance in mathematical reasoning, programming, and general capabilities, rivaling the performance of the 671-billion-parameter DeepSeek-R1. This showcases the potential of reinforcement learning in enhancing the intelligence of large language models and offers a possible pathway towards artificial general intelligence.

Language
Mar 6

🔥 Manus

ManusManus

Manus is a general-purpose AI agent that connects thought and action: it not only thinks but also delivers results. Manus excels at handling a wide variety of tasks in both work and life, enabling you to get things done while you rest. It provides efficient and convenient services to users by integrating information and generating customized solutions. The importance of Manus lies in its ability to save users time and effort through automation and intelligence, while simultaneously providing high-quality analysis and decision support.

Multimodal
Mar 5

CogView4

CogView4THUDM

CogView4 is a text-to-image generation system based on diffusion models, supporting Chinese input and Chinese text-to-image generation. It utilizes a cascaded diffusion framework and Diffusion Transformer technology to generate high-quality images. This model has demonstrated excellent performance in multiple benchmark tests, particularly exhibiting unique advantages in Chinese text generation.

Image
Mar 4

February

All (10)
Anthropic (2)
Alibaba (2)
Google (2)
Mistral AI (1)
xAI (1)
ByteDance (1)
OpenAI (1)

🔥 Claude 3.7 Sonnet

Claude 3.7 SonnetAnthropic

Claude 3.7 Sonnet is Anthropic's latest hybrid reasoning model, offering both rapid response and deep thinking capabilities. Users can fine-tune the model's deliberation time via API access. Claude 3.7 Sonnet excels in coding and front-end development, and demonstrates significantly improved performance in mathematics, physics, instruction following, and programming through its enhanced reasoning mode. It performs admirably in both standard and enhanced reasoning modes, allowing users to balance response speed and quality according to their needs. Anthropic aims to provide a more seamless user experience through a unified reasoning model, and Claude 3.7 Sonnet embodies this philosophy, optimizing LLM functionalities commonly used in real-world applications rather than focusing solely on benchmark tasks.

Multimodal
Feb 25

🔥 Claude Code

Claude CodeAnthropic

Claude Code is an intelligent programming tool integrated directly into your terminal, enabling developers to write code faster using natural language commands. It integrates seamlessly with your development environment, requiring no additional servers or complex setup. Capabilities include editing files, debugging code, answering questions about code architecture and logic, running tests, and performing code reviews. The significance of Claude Code lies in its ability to dramatically improve developer efficiency while simultaneously lowering the barrier to entry through natural language interaction. This product is powered by Anthropic's Claude-3-7-sonnet-20250219 model, offering robust code understanding and generation capabilities.

Language
Feb 25

🔥 QwQ-Max-Preview

QwQ-Max-PreviewAlibaba

QwQ-Max-Preview is a preview version built upon Qwen2.5-Max, belonging to the Tongyi Qianwen family. It excels in deep reasoning, mathematics, programming, and agent-related tasks. This product is planned for open-source release under the Apache 2.0 license in the near future, aiming to advance intelligent reasoning technology and foster community-driven innovation through open-source collaboration. Future releases will include a Qwen Chat app and smaller reasoning models (such as QwQ-32B) to cater to diverse user needs.

Language
Feb 25

🔥 Wan AI

Wan AIAlibaba

Wan AI is an advanced and powerful visual generation model developed by Alibaba Group's Tongyi Lab. It can generate videos based on text, images, and other control signals. The Wan 2.1 series models are now fully open-source. This product represents the cutting edge of AI in visual content generation, boasting significant innovation and application value. Its key advantages include powerful visual generation capabilities, support for diverse input signals, and its open-source nature, enabling developers and creators to leverage the platform for flexible creative development and content creation.

Video
Feb 25

🔥 PaliGemma 2 mix

PaliGemma 2 mixGoogle

PaliGemma 2 mix is a versatile vision-language model developed by Google, representing an upgraded version within the Gemma family. This model excels in handling a wide array of vision-language tasks, including image segmentation, video captioning, scientific question answering, and text-related tasks. It provides pre-trained checkpoints in various sizes (3B, 10B, and 28B parameters) and supports multiple resolutions (224px and 448px), empowering developers to select the most suitable model based on their specific needs. Furthermore, PaliGemma 2 mix offers broad framework support, including Hugging Face Transformers, Keras, PyTorch, JAX, and Gemma.cpp. Its versatility and ease of use make it a powerful tool for a diverse range of vision-language applications.

Multimodal
Feb 19

🔥 Mistral Saba

Mistral SabaMistral AI

Mistral Saba is Mistral AI's first regional language model specifically tailored for the Middle East and South Asia. Boasting 24B parameters, it's trained on a carefully curated dataset of Middle Eastern and South Asian languages. This allows it to deliver more accurate and relevant responses than models five times its size, while also being faster and more cost-effective. The model supports Arabic and various languages of Indian origin, with particular proficiency in South Indian languages such as Tamil. Available both through API and for on-premise deployment within secure customer environments, it's suitable for single-GPU systems and offers response speeds exceeding 150 tokens per second.

Language
Feb 17

🔥 Grok 3

Grok 3xAI

Grok 3 is the latest flagship AI model developed by xAI, designed for image analysis and question answering, and supporting various features of xAI's social network, X. It's a family of models, comprising versions like Grok 3 mini, Grok 3 Reasoning, and Grok 3 mini Reasoning. Grok 3 excels in several benchmark tests, surpassing GPT-4o in areas such as AIME (mathematics problems) and GPQA (graduate-level physics, biology, and chemistry questions). Its reasoning model is capable of fact-checking to avoid common errors, similar to OpenAI's o3-mini and DeepSeek's R1. Furthermore, Grok 3 supports AI-driven research through the DeepSearch feature in the Grok application, scanning the internet and the X social network to provide information summaries. The development of Grok 3 involved significant computational resources, including approximately 200,000 GPUs in a Memphis-based data center, and its training dataset includes legal documents, among other sources.

Multimodal
Feb 17

Goku

GokuByteDance

Goku is a foundational video generation model based on flow, focusing on the task of generating videos from text. Utilizing advanced generative technologies, this model can produce high-quality video content based on text prompts, supporting a variety of scenes and styles. Its significance lies in providing efficient content generation solutions for video creation, advertising, and other fields, reducing production costs and enhancing content diversity. Goku+ is a derivative version specifically optimized for advertising scenarios, capable of generating video content that better meets advertising needs.

Video
Feb 10

🔥 Gemini 2.0

Gemini 2.0Google

Gemini 2.0 is a significant advancement by Google in the field of generative AI, representing the latest in artificial intelligence technology. With its powerful language generation capabilities, it offers developers efficient and flexible solutions suitable for a variety of complex scenarios.

Multimodal
Feb 5

🔥 OpenAI Deep Research

OpenAI Deep ResearchOpenAI

Deep Research is an intelligent agent feature developed by OpenAI that can accomplish complex, multi-step research tasks in a short amount of time. It searches the internet and analyzes large volumes of information to provide users with comprehensive reports comparable to those of professional analysts. This tool is optimized based on the upcoming OpenAI o3 model and can process text, images, and PDF files. It is designed for users requiring in-depth research, such as professionals in finance, science, policy, and engineering, as well as consumers seeking personalized recommendations.

Multimodal
Feb 2

January

All (28)
OpenAI (4)
Mistral AI (2)
DeepSeek (3)
Anthropic (1)
小红书 (1)
ByteDance (3)
腾讯 (1)
MoonshotAI (1)
​Luma AI (1)
Black Forest Labs (1)
Moonshot AI (1)
MiniMax (1)
Jina AI (1)
阿里妈妈 (1)
商汤科技 (1)
阿里巴巴 (1)
Moondream (1)
PRIME (1)
Nvidia (1)
Jarvis (1)

🔥 OpenAI o3-mini

OpenAI o3-miniOpenAI

The OpenAI o3-mini is the latest inference model released by OpenAI, optimized specifically for the fields of Science, Technology, Engineering, and Mathematics (STEM). It delivers powerful reasoning capabilities while maintaining low cost and low latency, excelling particularly in areas such as mathematics, science, and programming. This model supports a variety of developer features, including function calling and structured output, and allows users to select different levels of inference strength based on their needs.

Multimodal
Jan 31

🔥 Mistral Small 3

Mistral Small 3Mistral AI

Mistral Small 3 is an open-source language model introduced by Mistral AI, featuring 24 billion parameters and licensed under the Apache 2.0 license. This model is specifically designed for low latency and high performance, making it suitable for generative AI tasks that require rapid responses. It achieves an accuracy of 81% on the Multi-Task Language Understanding (MMLU) benchmark and can generate text at a speed of 150 tokens per second.

Language
Jan 30

🔥 ChatGPT Gov

ChatGPT GovOpenAI

ChatGPT Gov is a tailored version of the AI model developed by OpenAI for U.S. government agencies. It aims to assist these organizations in effectively leveraging AI technology to tackle complex challenges. Built on OpenAI's cutting-edge technology, it supports government efforts in areas like public health, infrastructure, and national security, while adhering to stringent cybersecurity and compliance standards.

Multimodal
Jan 28

🔥 Janus-Pro

Janus-ProDeepSeek

Janus-Pro is an advanced multimodal model developed by the DeepSeek team, focusing on unified multimodal understanding and generation tasks. It addresses the conflicts encountered in traditional models for understanding and generation tasks by decoupling the visual encoding path. Based on a powerful Transformer architecture, this model can manage complex multimodal tasks, such as visual question answering and image generation.

Multimodal
Jan 27

Anthropic API Citations

Anthropic API CitationsAnthropic

The Citations feature of the Anthropic API is a powerful technology that enables the Claude model to reference exact sentences and paragraphs from source documents while generating responses. This functionality not only enhances the verifiability and credibility of the answers but also reduces the likelihood of hallucination issues that the model may encounter.

Language
Jan 24

FireRedASR

FireRedASR小红书

FireRedASR is an open-source, industrial-grade Mandarin automatic speech recognition (ASR) model family designed to meet the diverse needs for outstanding performance and optimal efficiency across various applications. It includes two variants: FireRedASR-LLM and FireRedASR-AED. The significance of this technology lies in advancing the field of speech recognition, providing efficient and accurate solutions for industrial-grade applications.

Audio
Jan 24

🔥 Operator

OperatorOpenAI

Operator is an intelligent agent product launched by OpenAI. By combining the visual capabilities of GPT-4o with advanced reasoning through reinforcement learning, it can interact with graphical user interfaces like a human. It is capable of handling various repetitive browser tasks, such as filling out forms and ordering groceries, helping users save time.

Multimodal
Jan 23

🔥 CUA

CUAOpenAI

The Computer-Using Agent (CUA) is an advanced artificial intelligence model developed by OpenAI, integrating the visual capabilities of GPT-4o with sophisticated reasoning skills acquired through reinforcement learning. It can interact with graphical user interfaces (GUIs) like a human, without relying on specific operating system APIs or web interfaces. The flexibility of the CUA allows it to perform tasks across various digital environments, such as filling out forms and browsing the web.

Multimodal
Jan 23

🔥 Doubao-1.5-pro

Doubao-1.5-proByteDance

Doubao-1.5-pro is a high-performance sparse MoE (Mixture of Experts) large language model developed by the Doubao team. This model achieves an optimal balance between performance and inference efficiency through an integrated design for training and inference. It excels in various public benchmarking tests, particularly showing significant advantages in inference efficiency and multimodal capabilities. This model is well-suited for scenarios that require efficient inference and multimodal interaction, such as natural language processing, image recognition, and voice interaction.

Multimodal
Jan 22

UI-TARS

UI-TARSByteDance

UI-TARS is a new type of GUI agent model developed by ByteDance, focusing on seamless interaction with graphical user interfaces through human-like perception, reasoning, and action capabilities. This model integrates key components such as perception, reasoning, positioning, and memory into a single visual language model, enabling end-to-end task automation without the need for predefined workflows or manual rules.

Multimodal
Jan 22

Hunyuan3D 2.0

Hunyuan3D 2.0腾讯

Hunyuan3D 2.0 is an advanced large-scale 3D synthesis system developed by Tencent, focusing on generating high-resolution textured 3D assets. The system consists of two core components: the large-scale shape generation model Hunyuan3D-DiT and the large-scale texture synthesis model Hunyuan3D-Paint. By decoupling the challenges of shape and texture generation, it provides users with a flexible platform for creating 3D assets.

Image
Jan 21

🔥 DeepSeek-R1

DeepSeek-R1DeepSeek

DeepSeek-R1 is the first-generation inference model launched by the DeepSeek team. Trained through extensive reinforcement learning, it demonstrates exceptional reasoning capabilities without the need for supervised fine-tuning. This model performs excellently in mathematics, coding, and reasoning tasks, comparable to the OpenAI-o1 model. Additionally, DeepSeek-R1 offers various distilled models suitable for different scales and performance requirements.

Language
Jan 20

🔥 Kimi k1.5

Kimi k1.5MoonshotAI

The Kimi k1.5 is a multimodal language model developed by MoonshotAI. It significantly enhances performance in complex reasoning tasks through reinforcement learning and long-context extension techniques. The model has achieved industry-leading results across multiple benchmarks, surpassing GPT-4o and Claude Sonnet 3.5 in mathematical reasoning tasks, such as AIME and MATH-500.

Language
Jan 20

🔥 Trae

TraeByteDance

Trae is an AI-driven Integrated Development Environment (IDE) designed for developers. It enhances coding efficiency through features like intelligent code completion, multimodal interaction, and contextual analysis of the entire codebase.

Language
Jan 20

🔥 Ray2

Ray2​Luma AI

Luma AI has launched the Ray2 video generation model, achieving faster and more natural motion effects. It primarily supports text-to-video functionality, allowing users to input descriptions to generate short videos ranging from 5 to 10 seconds.

Video
Jan 16

FLUX Pro Finetuning API

FLUX Pro Finetuning APIBlack Forest Labs

The FLUX Pro Finetuning API, developed by Black Forest Labs, is a customizable tool for generative text-to-image modeling. It allows users to fine-tune the FLUX Pro model using a small set of example images (1 to 5) to produce high-quality image content that aligns with specific branding, style, or visual requirements.

Image
Jan 16

🔥 moonshot-v1-vision-preview

moonshot-v1-vision-previewMoonshot AI

The Kimi Visual Model is an advanced image understanding technology provided by the Moonshot AI open platform. It accurately recognizes and comprehends elements within images, such as text, colors, and object shapes, offering users powerful visual analysis capabilities.

Image
Jan 15

🔥 MiniMax-01 series

MiniMax-01 seriesMiniMax

The MiniMax-01 series is an open-source model released by MiniMax, which includes MiniMax-Text-01 and MiniMax-VL-01. This series innovatively implements a large-scale lightning attention mechanism for the first time, offering performance comparable to the world's leading models. It efficiently handles ultra-long contexts of up to 4 million tokens, making it a pioneer in the era of AI agents.

Image
Jan 15

ReaderLM v2

ReaderLM v2Jina AI

ReaderLM v2, developed by Jina AI, is a lightweight language model with 1.5 billion parameters, specifically designed for HTML to Markdown conversion and HTML to JSON extraction, boasting exceptional accuracy. This model supports 29 languages and can handle input and output combinations of up to 512K tokens in length.

Language
Jan 15

🔥 Codestral 25.01

Codestral 25.01Mistral AI

Codestral 25.01 is an advanced programming assistance model launched by Mistral AI, representing the forefront of technology in the field of programming models. This model is lightweight, fast, and proficient in over 80 programming languages. It is optimized for low-latency and high-frequency use cases, supporting tasks such as code completion (FIM), code correction, and test generation.

Language
Jan 14

🔥 万相营造

万相营造阿里妈妈

Wanxiang Yingzao is an AI creative design tool launched by Alimama, aimed at helping merchants quickly generate high-quality creative materials to enhance marketing effectiveness. It leverages advanced AI technology to offer various functions, including image-to-video conversion, smart fitting, and copywriting generation, catering to the needs of e-commerce merchants in different marketing scenarios.

Video
Jan 14

🔥 DeepSeek APP

DeepSeek APPDeepSeek

The DeepSeek APP has officially launched, available on both iOS and Android platforms.

Language
Jan 13

🔥 日日新融合大模型

日日新融合大模型商汤科技

SenseTime has launched the 'Riri Xin' integrated large model, significantly enhancing deep reasoning and multimodal processing capabilities.

Language
Jan 10

🔥 通义万相2.1模型

通义万相2.1模型阿里巴巴

The Tongyi Wanxiang video generation model, under Alibaba, has launched a brand new version 2.1.

Video
Jan 9

🔥 Moondream2

Moondream2Moondream

Moondream is a compact visual language model designed for efficient operation on edge devices.

Language
Jan 9

🔥 OpenBMB PRIME

OpenBMB PRIMEPRIME

The urus-2-7B-PRIME is similar to o1 and employs the PRIME (Process Reinforcement via Implicit Reward) method for training. This method is an online reinforcement learning (RL) open-source solution that incorporates process rewards to enhance the reasoning capabilities of language models beyond mere imitation or distillation. It starts with the Eurus-2-7B-SFT and is trained on the Eurus-2-RL-Data.

Language
Jan 7

🔥 Nvidia Cosmos

Nvidia CosmosNvidia

NVIDIA Cosmos™ is a platform comprising state-of-the-art Generative World Foundation Models (WFM), advanced labeling tools, safety measures, and accelerated data processing and management pipelines. It is designed to expedite the development of physical AI systems such as autonomous vehicles (AVs) and robotics.

Video
Jan 6

🔥 J1 Assistant

J1 AssistantJarvis

Jarvis, an artificial intelligence startup founded by Luo Yonghao, has quietly launched an AI assistant software called 'J1Assistant'. Currently, the software is available only as an Android version overseas.

Language
Jan 6