In the field of artificial intelligence, text-to-audio generation technology is gradually becoming a research hotspot. Recently, researchers have introduced a new model called TANGOFLUX, which excels in both performance and efficiency.
TANGOFLUX is an efficient text-to-audio generation model with 515 million parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds. This speed allows it to perform exceptionally well on a single A40 GPU.
The main feature of TANGOFLUX is its ability to generate various sound effects, such as bird calls, whistles, and explosions. It also supports music generation, although the results are not as ideal.
A major challenge for text-to-audio generation models is how to create preference pairs. Unlike large language models (LLMs), text-to-audio generation models lack a verifiable reward mechanism or gold standard answers. To address this issue, the research team proposed a new framework called CLAP-Ranked Preference Optimization (CRPO). This framework iteratively generates and optimizes preference data to enhance the alignment performance of text-to-audio generation models. Research shows that audio preference data generated using CRPO outperforms existing alternatives.
Through this framework, TANGOFLUX has achieved leading performance in multiple objective and subjective benchmark tests. Additionally, the research team has decided to open source all code and models to support more research on text-to-audio generation. For applications requiring audio generation, TANGOFLUX is undoubtedly a significant technological advancement.
In terms of practical effects, TANGOFLUX surpasses other models in audio generation quality, demonstrating clearer event sounds, better event sequence reproduction, and higher audio quality. By comparing multiple examples, users can intuitively feel the advantages of TANGOFLUX in audio generation.
Prompt: The harmonious coexistence of human whistling and natural bird calls produces the following result:
With the advent of this new technology, the application prospects for text-to-audio generation are becoming increasingly broad, potentially playing an important role in film production, game sound effects, and more.
Project entry: https://tangoflux.github.io/
Key points:
🎧 TANGOFLUX is an efficient text-to-audio generation model that can produce 30 seconds of high-quality audio in just 3.7 seconds.
🔧 The CLAP-Ranked Preference Optimization (CRPO) framework was proposed to optimize model performance and audio preference data.
🌍 All code and models have been open-sourced to promote research and application in text-to-audio generation.