In today's rapidly advancing technological landscape, voice synthesis technology is also making strides, particularly in the field of restoring lost voices. Recently, researchers at Google have introduced a new technique called "zero-shot voice transfer," which can be directly integrated with state-of-the-art text-to-speech (TTS) systems to help those who have lost their voices due to illness or accidents regain their "voice memories."

The core of this technology lies in its "zero-shot" capability, meaning we do not need extensive samples to achieve voice conversion. This means that with just a few seconds of reference audio, voice cloning can be accomplished, and it supports the synthesis of cross-language audio.

"Zero-shot" voice cloning capability

The research team demonstrated the powerful functionality of this technology using audio samples from the VCTK speech corpus. For instance, by utilizing pre-recorded Mandarin, English, and Spanish audio, the system can mimic the vocal characteristics of these languages, generating synthesized speech that closely resembles the original voices.

image.png

Project entry: https://google.github.io/tacotron/publications/zero_shot_voice_transfer/

Remarkably, this conversion is not limited to a single language; the research also showcased the ability to synthesize voices in languages such as French, German, and even Arabic, using English voice samples, which was quite refreshing.

To validate the effectiveness of the technology, researchers conducted numerous experiments, including collaborations with speakers with unique pronunciations. They generated similar voices using only 12 and 14 seconds of audio samples, fully demonstrating the technology's strong adaptability.

In testing, researchers extended this technology to six different languages, further showcasing its flexibility and practicality.

Support for multilingual examples:

The promotion of this technology not only helps individuals who have lost their voices to regain them but also opens up new possibilities for cross-language communication, enhancing the efficiency and convenience of barrier-free communication. Indeed, the emergence of zero-shot voice transfer technology will enrich our lives, allowing everyone to freely navigate the ocean of languages and enjoy the pleasure of communication.

Key points  

🎤 **Zero-shot voice conversion technology**: A voice synthesis technique that requires no extensive samples, helping those without voices to regain them.  

🌍 **Language capabilities**: The technology can achieve voice conversion between different languages, greatly enriching the possibilities of voice communication.

🗣️ **Application for speakers with unique pronunciations**: By using short audio samples, the team successfully synthesized speech for speakers with unique pronunciations, demonstrating the adaptability and flexibility of the technology.