In a recent podcast, Demis Hassabis, CEO of Google DeepMind, revealed Google's plan to eventually integrate its Gemini AI model with its video generation model, Veo, to enhance Gemini's understanding of the physical world. He emphasized that Gemini was designed from the outset to be multimodal, aiming for a "universal digital assistant" that genuinely helps users in the real world.

Google's Gemini large language model

Hassabis mentioned the AI industry's shift towards "omnipotent" models capable of understanding and synthesizing various media forms. For instance, Google's latest Gemini model generates text, images, and audio. OpenAI's default model in ChatGPT also natively creates images. Furthermore, Amazon announced plans to launch an "anything-to-anything" model this year.

Developing these omnipotent models requires vast training data, including images, videos, audio, and text. Hassabis hinted that Veo's training data primarily comes from Google's YouTube platform. He stated that by watching countless YouTube videos, Veo learns the physical laws of the world.

Google previously stated its models "may" be trained on "some" YouTube content, according to agreements with YouTube creators. Reports indicate Google expanded its terms of service last year to access more data for AI model training. This strategy reflects Google's proactive approach to enhancing its AI capabilities to meet market demands.

With the rapid advancement of AI technology, Google's plan highlights the industry's focus on multimodal AI and its potential future direction. Combining Gemini and Veo will offer users richer interactive experiences, enabling AI to better integrate into daily life.

Key Takeaways:

- 🤖 Google plans to integrate Gemini and Veo AI models to improve understanding of the physical world.

- 🎥 Veo's training data primarily comes from YouTube, learning physical laws from countless videos.

- 🌐 The AI industry is moving towards multimodal "omnipotent" models to meet growing market demands.