Google Gemini 2.0 Released: 2.0 Flash Now Supports Multimodal Outputs

AIbase基地

Published inAI News · 7 min read · Dec 12, 2024

CEO Sundar Pichai of Google and its parent company Alphabet announced the launch of the latest artificial intelligence model - Gemini 2.0, marking an important step forward for Google in the development of a general AI assistant. Gemini 2.0 demonstrates significant advancements in processing multimodal inputs and utilizing native tools, enabling AI agents to understand the surrounding world more deeply and take actions on behalf of users under their supervision.

Gemini 2.0 is developed based on its predecessors, Gemini 1.0 and 1.5, the latter of which first achieved native multimodal processing capabilities, understanding various types of information including text, video, images, audio, and code. Currently, millions of developers are using Gemini for their projects, prompting Google to rethink its products, including seven products serving 2 billion users, and to create new offerings. NotebookLM is an example of multimodal and long-context capabilities, which has gained widespread popularity.

WeChat Screenshot_20241212080452.png

The launch of Gemini 2.0 signifies a new era of agents for Google, as this model has native capabilities for image and audio output, as well as the use of native tools. Google has begun providing Gemini 2.0 to developers and trusted testers, with plans to quickly integrate it into products, starting with Gemini and Search. Effective immediately, the Gemini 2.0 Flash experimental model will be accessible to all Gemini users. Additionally, Google has introduced a new feature called Deep Research, which employs advanced reasoning and long-context capabilities to act as a research assistant, exploring complex topics and compiling reports on behalf of users. This feature is currently available in Gemini Advanced.

As one of the products most affected by AI, Google's AI overview now reaches 1 billion people, enabling them to ask entirely new questions, quickly becoming one of Google's most popular search features. As the next step, Google will incorporate the advanced reasoning capabilities of Gemini 2.0 into the AI overview to address more complex topics and multi-step problems, including advanced mathematical equations, multimodal queries, and coding. Limited testing began this week, with plans for a broader rollout early next year. Google will also continue to expand the AI overview to more countries and languages over the next year.

Google has also showcased cutting-edge results from its agent research through Gemini 2.0's native multimodal capabilities. Gemini 2.0 Flash is an improvement over 1.5 Flash, which has been the most popular model among developers so far, featuring similar fast response times. Notably, 2.0 Flash has even surpassed 1.5 Pro in key benchmark tests at double the speed. 2.0 Flash also introduces new capabilities. In addition to supporting multimodal inputs such as images, videos, and audio, 2.0 Flash now supports multimodal outputs, including natively generated images mixed with text and controllable multilingual text-to-speech (TTS) audio. It can also natively call tools such as Google Search, code execution, and third-party user-defined functions.

WeChat Screenshot_20241212080808.png

Gemini 2.0 Flash is now available to developers as an experimental model, allowing all developers to use multimodal inputs and text outputs through the Google AI Studio and Vertex AI's Gemini API, while text-to-speech and native image generation are provided to early access partners. General availability will follow in January, along with more model sizes being introduced.

To assist developers in building dynamic and interactive applications, Google has also released a new multimodal real-time API, capable of real-time audio and video streaming input and utilizing multiple combinatorial tools.

Starting today, Gemini users worldwide can access the chat-optimized version of 2.0 Flash by selecting it from the model dropdown menu on desktop and mobile web. It will soon be available in the Gemini mobile app. Early next year, Google plans to expand Gemini 2.0 to more Google products.

How the Advertising Industry Adapts to the AI Era: From Google to ChatGPT

Google's rise in the history of the internet is almost legendary. Founded in 1999, Google attracted a massive user base with its clean, ad-free search experience. Early on, founders Larry Page and Sergey Brin staunchly avoided advertising, believing it would compromise search quality. However, by 2000, to achieve profitability, Google launched AdWords, rapidly transforming into an advertising revenue giant. Advertising gradually became a significant component of search results pages. However...

China's First Multimodal AI Programmer Officially Launches: Wenxin Quick Code Coding Intelligent Agent Zulu

Baidu's Create AI Developer Conference was grandly held in Beijing. At this highly anticipated technology event, Baidu officially released the Wenxin Quick Code 3.5 version and China's first multimodal AI programmer – the Wenxin Quick Code Comate Zulu intelligent agent, marking a new stage in the development of AI programming tools.

Zhipu AI and Shengshu Technology Announce Strategic Partnership to Focus on Large Model Joint Innovation

On April 27, Zhipu AI (Z.ai) and Shengshu Technology (shengshu.com), two leading artificial intelligence companies under Tsinghua University, announced a major strategic partnership. This collaboration aims to leverage both companies' technological expertise in large language models and multi-modal generative models to jointly advance the technological innovation and industrial application of domestic large models.

Moonshot AI Unveils Kimi-Audio: A New Benchmark for Open-Source Audio Foundation Models

Moonshot AI recently announced the launch of Kimi-Audio, a new open-source audio foundation model aimed at advancing the field of audio understanding, generation, and interaction. This release has garnered significant attention from the global AI community and is considered a major milestone in the development of multimodal AI. This report provides a comprehensive overview of Kimi-Audio's core features, performance, and industry impact. Breakthrough Features: Versatile Audio Processing Capabilities Kimi-Audio-7B-Instruct based on Qwen