In the field of music and sound creation, the combination of technology and creativity always faces numerous challenges. Existing AI models often excel at specific tasks but lack broad adaptability, which limits the supportive role of AI in music production. To better serve music and audio production, there is an urgent need for a versatile model that can flexibly respond to various creative demands. To this end, NVIDIA has introduced Fugatto, an audio generation and processing model with 2.5 billion parameters.

image.png

Fugatto is designed to provide high flexibility in sound input and creative experimentation by combining text prompts with advanced audio synthesis capabilities. For example, it can transform a piano melody into a vocal performance or allow a trumpet to produce unexpected sounds.

image.png

Fugatto supports not only text input but also optional audio input, breaking the limitations of traditional audio generation models, enabling artists and developers to create and modify in real time, smoothly generating new types of sounds.

Technically, Fugatto employs an innovative data generation method that surpasses traditional supervised learning. Its training relies not only on conventional datasets but also incorporates specially generated datasets, creating a rich variety of audio and transformation tasks. Additionally, Fugatto leverages large language models (LLM) to enhance instruction generation capabilities, better understanding the relationship between audio and text prompts.

image.png

A significant innovation is the "Composable Audio Representation Transformation" (ComposableART), a technique used during inference that allows for the flexible combination, interpolation, or negation of different audio generation instructions. ComposableART gives users greater control during the audio synthesis process, enabling precise navigation of Fugatto's sound palette to create unique sound phenomena.

Fugatto's architecture is based on an enhanced Transformer model, featuring specific modifications such as adaptive layer normalization, which maintains consistency across various input conditions and supports complex combinatorial instructions. Preliminary tests indicate that Fugatto performs well on common benchmarks, particularly in sound synthesis and transformation, demonstrating stronger capabilities compared to other specialized models.

The launch of Fugatto marks a significant advancement in audio generation AI, breaking traditional limitations and providing powerful and flexible tools for creative audio production. Its potential applications in music, gaming, entertainment, and education suggest that AI technology will continue to play an important role in enhancing human creativity.

Official Blog: https://blogs.nvidia.com/blog/fugatto-gen-ai-sound-model/

Paper: https://d1qx31qr3h6wln.cloudfront.net/publications/FUGATTO.pdf

Highlights:

🎵 Fugatto is an audio AI model launched by NVIDIA with 2.5 billion parameters, supporting both text and audio input, aiding music and sound creation.  

💻 It employs innovative data generation methods and composable audio representation transformation technology, allowing users to flexibly generate and modify sounds.  

🌟 Preliminary tests show that Fugatto outperforms various specialized models in audio synthesis and transformation, showcasing its strong creative potential.