With just a simple hum or a rhythmic tap, AI can generate high-quality music or sound effects, and this is no longer a fantasy. An innovative research project called Sketch2Sound demonstrates a brand new AI model that can create high-quality audio through sound imitation and text prompts, bringing revolutionary breakthroughs to the field of sound creation.

image.png

The core technology of Sketch2Sound lies in its ability to extract three key time-varying control signals from any sound imitation (such as voice imitation or reference sounds): loudness, brightness (spectral centroid), and pitch. Once these control signals are encoded, they are added to the latent diffusion model used for text-to-sound generation, guiding the AI to produce sounds that meet specific requirements.

The most commendable aspect of this technology is its lightweight and efficiency. Sketch2Sound builds upon existing text-to-audio latent diffusion models, requiring only 40,000 steps of fine-tuning, with each control signal needing just one linear layer, making it simpler and more efficient compared to other methods (like ControlNet). To enable the model to synthesize from "sketch-like" sound imitations, researchers applied random median filters to the control signals during training, allowing them to adapt to control signals with flexible temporal characteristics. Experimental results show that Sketch2Sound not only synthesizes sounds that match the input control signals but also adheres to the text prompts, achieving audio quality comparable to pure text baselines.

Sketch2Sound provides sound artists with a new way to create. They can utilize the semantic flexibility of text prompts, combined with the expressiveness and precision of sound posture or imitation, to create unprecedented sound works. This is similar to traditional Foley artists who create sound effects by manipulating objects, while Sketch2Sound guides sound generation through sound imitation, bringing a "human touch" to sound creation and enhancing the artistic value of sound works.

Compared to traditional text-to-audio interaction methods, Sketch2Sound overcomes its limitations. In the past, sound designers needed to spend a lot of time adjusting the temporal characteristics of generated sounds to sync with visual effects, whereas Sketch2Sound can naturally achieve this synchronization through sound imitation, and it is not limited to just voice imitation; any type of sound imitation can be used to drive this generative model.

Researchers have also developed a technique that adjusts the temporal details of control signals by applying median filters of different window sizes during training. This allows sound artists to control how closely the generative model adheres to the temporal accuracy of the control signals, improving the quality of sounds that are difficult to imitate perfectly. In practical applications, users can find a balance between strictly adhering to sound imitation and ensuring audio quality by adjusting the size of the median filter.

The way Sketch2Sound works is by first extracting the three control signals: loudness, spectral centroid, and pitch from the input audio signal. Then, these control signals are aligned with the latent signals in the text-to-sound model, and the latent diffusion model is adjusted through a simple linear projection layer to ultimately generate the desired sound. Experimental results show that adjusting the model with time-varying control signals can significantly enhance adherence to these signals while having minimal impact on audio quality and text adherence.

It is worth mentioning that researchers also found that control signals can manipulate the semantics of the generated signals. For example, when using the text prompt "forest ambiance," if random loudness bursts are included in the sound imitation, the model can synthesize bird calls within those loudness bursts without an additional prompt for "birds," indicating that the model has learned the association between loudness bursts and the presence of birds.

Of course, Sketch2Sound also has some limitations, such as the centroid control potentially incorporating the room tone of the input sound imitation into the generated audio, possibly because the room tone is encoded by the centroid when there are no sound events in the input audio.

In summary, Sketch2Sound is a powerful generative sound model that can produce sounds through text prompts and time-varying controls (loudness, brightness, pitch). It generates sound through sound imitation and "sketch" control curves, featuring lightweight and high efficiency, providing sound artists with a controllable, expressive, and nuanced tool capable of generating any sound with flexible temporal characteristics, holding great promise for future applications in music creation, game sound design, and more.

Paper link: https://arxiv.org/pdf/2412.08550