Google's New Voice Cloning Technology: Voice Cloning with Just a Few Seconds of Audio Sample

AIbase基地

Published inAI News · 5 min read · Sep 25, 2024

517

In today's rapidly advancing technological landscape, voice synthesis technology is also making strides, particularly in the field of restoring lost voices. Recently, researchers at Google have introduced a new technique called "zero-shot voice transfer," which can be directly integrated with state-of-the-art text-to-speech (TTS) systems to help those who have lost their voices due to illness or accidents regain their "voice memories."

The core of this technology lies in its "zero-shot" capability, meaning we do not need extensive samples to achieve voice conversion. This means that with just a few seconds of reference audio, voice cloning can be accomplished, and it supports the synthesis of cross-language audio.

"Zero-shot" voice cloning capability

The research team demonstrated the powerful functionality of this technology using audio samples from the VCTK speech corpus. For instance, by utilizing pre-recorded Mandarin, English, and Spanish audio, the system can mimic the vocal characteristics of these languages, generating synthesized speech that closely resembles the original voices.

Project entry: https://google.github.io/tacotron/publications/zero_shot_voice_transfer/

Remarkably, this conversion is not limited to a single language; the research also showcased the ability to synthesize voices in languages such as French, German, and even Arabic, using English voice samples, which was quite refreshing.

To validate the effectiveness of the technology, researchers conducted numerous experiments, including collaborations with speakers with unique pronunciations. They generated similar voices using only 12 and 14 seconds of audio samples, fully demonstrating the technology's strong adaptability.

In testing, researchers extended this technology to six different languages, further showcasing its flexibility and practicality.

Support for multilingual examples:

The promotion of this technology not only helps individuals who have lost their voices to regain them but also opens up new possibilities for cross-language communication, enhancing the efficiency and convenience of barrier-free communication. Indeed, the emergence of zero-shot voice transfer technology will enrich our lives, allowing everyone to freely navigate the ocean of languages and enjoy the pleasure of communication.

Key points
🎤 **Zero-shot voice conversion technology**: A voice synthesis technique that requires no extensive samples, helping those without voices to regain them.
🌍 **Language capabilities**: The technology can achieve voice conversion between different languages, greatly enriching the possibilities of voice communication.
🗣️ **Application for speakers with unique pronunciations**: By using short audio samples, the team successfully synthesized speech for speakers with unique pronunciations, demonstrating the adaptability and flexibility of the technology.

TEN Agent Open Source TEN VAD and Turn Detection Enable Ultra-Low Latency for Speech AI

The TEN Agent team recently announced that its core models **TEN Voice Activity Detection (VAD)** and **TEN Turn Detection** are now open source, providing powerful technical support for building real-time, multimodal speech AI agents. This move marks a significant advancement in the TEN framework's efforts to promote the democratization and open-source collaboration of speech interaction technology. The following is the latest information compiled by AIbase, offering an in-depth analysis of these two core models.

ByteDance Releases Innovative Image Synthesis Technology XVerse: Independent and Precise Control over Multiple Individuals

On June 26, 2025, ByteDance officially launched its latest image synthesis technology - XVerse, aimed at providing a high-precision multi-subject image generation solution. This innovative technology enables users to independently and precisely control multiple individuals, greatly enhancing the ability to generate personalized and complex scenes. The core of XVerse lies in its unique DiT modulation method, which allows control over the identity and semantic attributes of each subject without affecting the overall latent features of the image. By converting reference images into specific characteristics...

Tesla Full Self-Driving Delivery Video Shocks the World: Fully Autonomous from Factory to Customer's Home!

Tesla once again leads the automotive industry's technological revolution! Recently, Tesla released the world's first artificial intelligence (AI) full self-driving (FSD) delivery video from factory to customer's home, showcasing the latest breakthroughs in its autonomous driving technology. This 17-mile journey, lasting about 30 minutes, spans parking lots, highways, and city roads, ultimately delivering the vehicle accurately to the new owner's home. Full autonomous driving, a technological milestone. The video released by Tesla demonstrates the impressive performance of its FSD system in real-world scenarios. Starting from the factory, the car

New Release of Qwen-TTS Adds Support for Three Chinese Dialects

Recently, a speech synthesis model called Qwen-TTS has made new progress, with its latest version update completed through the Qwen API, bringing users a richer speech synthesis experience. In this update, Qwen-TTS added support for three Chinese dialects: Beijing dialect, Shanghai dialect, and Sichuan dialect, further expanding its application scenarios. The model is trained on a large-scale corpus of more than 3 million hours, achieving naturalness and expressiveness at a human level. Qwen-TTS can not only accurately

Breaking News! GPT-5 is About to Arrive, Take You into a New Multimodal AI Era!

Recently, news about OpenAI's upcoming release of GPT-5 has attracted widespread attention in the technology industry. According to insiders, GPT-5 has already started a gradual test and is expected to be officially launched in July this year. This new model will adopt a multimodal design, meaning it can not only process text input but also understand speech, images, code, and even videos, completely changing the way we interact with AI. Sam Altman, CEO of OpenAI, stated that the launch of GPT-5 will mark a new era in AI.

Gemini2.5Pro API Returns Free, Developer Community Responds Enthusiastically

Recently, Google announced that the API of its flagship AI model, Gemini2.5Pro, has been reintroduced to the free tier of Google AI Studio. This news has triggered widespread attention and enthusiastic discussions within the developer community. According to AIbase, this move marks another important advancement in Google's efforts to popularize AI technology, offering developers lower barriers to innovation. As the most advanced AI model from Google so far, Gemini2.5Pro is known for its exceptional multimodal capabilities and strong reasoning power.