Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we bring you the hottest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications.

1、Google is in a hurry, the ultimate weapon to take on GPT-4o. The Veo video model takes on Sora

Google has recently released a series of powerful AI tools, including Project Astra, the Veo video model, and Gemini 1.5Pro, aimed at completely revolutionizing Google search and challenging OpenAI. Among them, the Veo video model is seen as a direct challenge to OpenAI's Sora, with stunning cinematic and professional-grade generative effects. Google has combined several groundbreaking achievements to improve the consistency, quality, and resolution of video generation. The release of these AI tools marks Google's continuous progress and innovation in the field of artificial intelligence, with competition set to continue escalating.

image.png

【AiBase Summary】

🔸 The ultimate weapon, Project Astra, with visual recognition and voice interaction effects, is on par with GPT-4o.

🔸 Gemini 1.5Pro has a super long context window, with a token count of up to 2 million, and is open for personal use

🔸 The Veo video model takes on Sora, generating videos that are not only realistic but also have stunning cinematic effects in terms of lighting and composition

Veo video generation application portal: https://aitestkitchen.withgoogle.com/tools/video-fx

Gemini experience address: https://aistudio.google.com/app/prompts/new_chat

2、Microsoft announces GPT-4o model available on Azure OpenAI

Microsoft has announced that the latest multi-modal model GPT-4o is now available on Azure OpenAI, supporting cross-text, video, and audio multi-modal reasoning, with powerful multi-modal interpretation and output capabilities. GPT-4o has broad application prospects in education, language learning, image evaluation, and other fields.

【AiBase Summary:】

🔸 GPT-4o supports cross-text, video, and audio multi-modal reasoning, demonstrating powerful multi-modal interpretation and output capabilities

🔸 In the education field, it can serve as an AI tutoring assistant, helping students answer questions and conduct real-time language translation

🔸 In language learning, it excels by learning Spanish through videos, with broad application prospects in the field of image evaluation

3、ByteDance officially releases the self-developed Doubao large model series

ByteDance has launched the Doubao large model series at the 2024 Spring Volcano Engine FORCE Original Power Conference, showcasing its deep accumulation and innovative capabilities in the field of artificial intelligence. The Doubao large model has been widely applied internally and will assist in the intelligent upgrading of the industry through external services. This innovative achievement reflects ByteDance's technical accumulation and insight into the future development of AI.

image.png

【AiBase Summary:】

✨ ByteDance introduces the Doubao large model series, including nine models, demonstrating deep technical accumulation and innovative capabilities.

🚀 The Doubao large model has been widely applied internally, with external services set to assist in the intelligent upgrading of the industry.

💡 The innovative achievement reflects ByteDance's technical accumulation and insight into the future development of AI.

Details: https://www.chinaz.com/2024/0515/1616629.shtml

4、Alibaba International launches AI virtual try-on tool, done in 1 minute

Alibaba International's Pic has launched an AI virtual try-on tool that brings revolutionary cost savings and efficiency improvements to clothing merchants. Merchants only need to upload clothing images and select models to generate product images with professional studio effects in a short time, with average costs controlled within 0.2-0.3 RMB. This tool not only simplifies the shooting process but also ensures the legality of model authorization, and has been warmly welcomed by North American customers at the Canton Fair.

image.png

【AiBase Summary:】

👗 The AI virtual try-on tool helps clothing merchants save on shooting costs by putting products on models to generate model images.

📸 The virtual try-on function supports uploading images of tops and bottoms, recognizing one-piece clothing, and generating display images with different effects.

💰 Merchants using the virtual try-on function can control average costs within 0.2-0.3 RMB, significantly reducing shooting costs and promoting global market sales of products.

5、Tencent open sources the HunyuanDiT image generation model that can generate and refine images based on conversational context

This article introduces Tencent's open-source HunyuanDiT image generation model, which has a fine-grained understanding of both Chinese and English, and can generate and refine images based on conversational context. HunyuanDiT combines the Transformer structure, text encoding, and positional encoding, training a multi-modal large language model that brings significant effects to image generation tasks. The model has broad application prospects in natural language processing, image generation, and other fields.

【AiBase Summary:】

🔑 HunyuanDiT adopts the Transformer structure, which has been successful in the field of text processing.

🔑 Through text encoding and positional encoding, HunyuanDiT achieves fine-grained understanding of Chinese.

🔑 Training a multi-modal large language model enables HunyuanDiT to generate accurate and descriptive image descriptions.

Details link: https://github.com/Tencent/HunyuanDiT

6、ElevenLabs releases dubbing API, allowing developers to add audio or video translation features to their products

ElevenLabs has recently released a dubbing API, providing developers with the convenience of adding audio or video translation features to their products. The API supports translation into 29 languages while retaining the original speaker's voice characteristics. Developers can quickly get started with Python tutorials and API references, easily integrating it into major programming languages. ElevenLabs has also launched a product called ElevenLabs Music that generates songs from text, demonstrating excellent musical performance and creative capabilities.

image.png

【AiBase Summary:】

🔊 The dubbing API allows translation of audio or video into 29 languages while retaining the original voice characteristics.

🎶 ElevenLabs Music excels in music, including rhythm, harmony, creativity, etc.

🎤 ElevenLabs' main products include voice cloning, text-to-speech, and AI dubbing solutions.

Details link: https://elevenlabs.io/docs/api-reference/create-dub

7、MiniMax launches life partner "Conch AI"

MiniMax has launched a product called "Conch AI" that serves as an external brain and life partner for students, new entrants to the workforce, freelancers, creators, and various other demographics, helping to alleviate the pressure caused by information overload and high-speed operation. The Conch AI is intelligent and efficient, supporting the processing of long-form content, understanding emotions, and patiently listening to users, supporting various interaction methods. It has been widely used, solving user problems 24/7, and hopes to accompany users through different stages of life.

image.png

【AiBase Summary:】

🧠 Intelligent and efficient: The Conch AI connects to a self-developed multi-modal large model, supporting the processing of long-form content, with intelligent and efficient characteristics.

💬 Humanized interaction: The product has warmth, understands emotions, and patiently listens to users, supporting various interaction methods such as text input, file upload, and voice communication.

🌟 Multi-group application: Used by various user groups from students preparing for exams to operators in large factories, demonstrating diverse usage scenarios.

8、Android is set to introduce AI-based scam call detection feature

Google is developing a new protection feature that uses Gemini Nano technology to identify fraudulent language and conversation patterns in scam calls. Users will receive real-time alerts and be encouraged to end suspicious calls. The feature monitors on-device, maintaining the privacy of the conversation, and helps prevent fraudulent activities.

image.png

【AiBase Summary:】

🔍 Utilizing Gemini Nano technology to identify fraudulent language and conversation patterns in scam calls, providing real-time alerts.

🚫 Users will receive alerts prompting them to end suspicious calls, avoiding the disclosure of personal information or being scammed.

💡 Gemini Nano is currently only supported on Google Pixel 8 Pro and Samsung S24 series phones, limiting the feature's applicability.

9、Google plans to directly integrate Gemini Nano AI into Chrome browser

Google plans to directly integrate Gemini Nano AI into the Chrome browser, which means users will be able to generate content such as social media posts and product reviews directly within the browser, while also providing developers with explanations of error messages and suggestions for code fixes. Gemini Nano runs locally on devices, providing a faster and more privacy-protecting AI experience.

【AiBase Summary:】

✨ Gemini Nano will be directly embedded in the Chrome browser, allowing users to generate content such as social media posts and product reviews.

🔧 As part of Chrome DevTools, Gemini Nano provides developers with explanations of error messages and suggestions for code fixes.

⚡ Gemini Nano runs locally on devices, providing a faster and more privacy-protecting AI experience.

10、Google introduces new AI model LearnLM, focusing on the education sector

Google's new AI model LearnLM is designed to help students solve homework problems and improve learning outcomes by integrating with other Google products to provide various learning assistance functions, such as simplifying lesson plans, answering math and physics questions, etc.

image.png

【AiBase Summary:】

🤖 LearnLM is an AI model developed by Google based on Gemini, designed to help students solve homework problems and improve learning outcomes.

📚 LearnLM can find and display examples in various ways to tutor students and stimulate learning interest.

💡 LearnLM has been integrated with Google Search, Android, YouTube, and Gem chatbot, simplifying lesson plans, answering video questions, providing personal experts, and more.

Details link: https://blog.google/outreach-initiatives/education/google-learnlm-gemini-generative-ai/

11、Google expands AI content watermarking technology to video and text domains

Google has announced the expansion of AI content watermarking technology to the video and text domains, introducing a new digital watermarking technology called SynthID, used to mark content generated by AI. This move is of great significance in dealing with the spread of political misinformation and harmful content.

【AiBase Summary:】

🔍 SynthID is a new digital watermarking technology that can mark AI-generated video and text.

🛡️ Digital watermarks cannot be visually discerned by humans but can be detected by systems, helping to combat the spread of political misinformation and harmful content.

🌐 Digital watermarks for AI-generated content are becoming increasingly important, especially when AI is misused, and Google's SynthID is one of them.