AudioSep
AudioSep is an open-domain audio source separation model based on natural language queries. It consists of two key components: a text encoder and a separation model. We trained AudioSep on a large-scale multimodal dataset and extensively evaluated its capabilities on many tasks, including audio event separation, instrument separation, and voice enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability, significantly outperforming previous audio query and language query sound separation models when using audio titles or text labels as queries.
CommonProductMusicAudio SeparationNatural Language Queries
AudioSep is an open-domain audio source separation model based on natural language queries. It consists of two key components: a text encoder and a separation model. We trained AudioSep on a large-scale multimodal dataset and extensively evaluated its capabilities on many tasks, including audio event separation, instrument separation, and voice enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability, significantly outperforming previous audio query and language query sound separation models when using audio titles or text labels as queries. To ensure the reproducibility of this work, we will release the source code, evaluation benchmark, and pre-trained models.
AudioSep Visit Over Time
Monthly Visits
18167391
Bounce Rate
44.37%
Page per Visit
3.2
Visit Duration
00:04:16