Do you remember the scenes in sci-fi movies where the protagonist waves a magic wand to control sounds at will? Now, this magical ability is no longer a fantasy! NVIDIA's latest AI model, Fugatto, acts like a "sound magic wand," allowing users to manipulate music, sounds, and voices using just text, creating a variety of amazing auditory effects.

Fugatto, short for "Foundational Generative Audio Transformer Opus1," is an audio processing model based on generative AI technology. Unlike other AI models that can only create music or modify speech, Fugatto has more powerful capabilities, allowing it to generate or transform a blend of music, speech, and sounds. It can also understand and execute commands provided by users through text and audio files.

Fugatto's powerful features have amazed users from various fields, including music producers, advertising agencies, language learning tool developers, and game developers. Music producers can quickly experiment with different musical styles, vocals, and instruments, even adding effects or enhancing the quality of existing songs. Advertising agencies can use it to add different accents and emotions to voiceovers, effortlessly promoting ads to different regions and target audiences. Language learning tool developers can utilize Fugatto to convert course content into any voice the user desires, such as that of family or friends, making learning more personalized. Game developers can leverage Fugatto to modify sound materials in real-time based on game progress or create entirely new game sound effects based on text commands and audio inputs.

The magic of Fugatto lies in its ability to understand and generate sounds like a human. It can execute specific commands given by users and create unprecedented new sounds. For instance, it can make a trumpet sound like a dog barking or a saxophone imitate a cat meowing; as long as the user can describe it, Fugatto can create it.

Audio Sound Waves

Image Source Note: Image generated by AI, image authorized by service provider Midjourney

Another groundbreaking ability of Fugatto is its capacity to combine commands learned separately during training to generate more complex effects. For example, users can request it to generate a voice with a sad emotion in a French accent. Even more astonishing, Fugatto allows users to make subtle adjustments to commands, such as controlling the intensity of the accent or the strength of the sad emotion, enabling users to create like artists.

Fugatto can also generate sounds that change over time, such as a storm approaching from afar, with thunder gradually intensifying and then slowly fading away. Users can precisely control the process of sound variation, creating a variety of vivid sound effects.

Fugatto is a product developed collaboratively by researchers from around the globe, with team members from countries like India, Brazil, China, Jordan, and South Korea. Their diverse backgrounds give Fugatto stronger capabilities in handling multiple accents and languages.

The birth of Fugatto is the culmination of years of research by NVIDIA in fields such as speech modeling, audio coding, and audio understanding. It utilizes 2.5 billion parameters and was trained on an NVIDIA DGX system cluster equipped with 32 NVIDIA H100 Tensor Core GPUs.

The emergence of Fugatto marks a new era in audio processing technology. It will bring limitless possibilities to various fields such as music, film, gaming, and education. Let's look forward to it creating even more amazing auditory feasts!

Official Blog: https://blogs.nvidia.com/blog/fugatto-gen-ai-sound-model/