TangoFlux
An efficient text-to-audio generation model
CommonProductMusicText-to-audioAudio generation
TangoFlux is an efficient text-to-audio (TTA) generation model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. The model introduces the CLAP-Ranked Preference Optimization (CRPO) framework to address the alignment challenges of TTA models, enhancing TTA alignment through iterative generation and optimization of preference data. TangoFlux achieves state-of-the-art performance in both objective and subjective benchmark tests, and all code and models are open-source to support further research in TTA generation.