The research team from Tsinghua University recently unveiled a mobile sound source simulation platform called SonicSim, designed to address the scarcity of data in the field of speech processing under mobile sound source scenarios.
This platform, built on the Habitat-sim simulation framework, can highly realistically mimic real-world acoustic environments, providing superior data support for the training and evaluation of speech separation and enhancement models.
Most existing datasets for speech separation and enhancement are based on static sound sources, which are difficult to meet the needs of mobile sound source scenarios.
Although there are some datasets recorded in real-world environments, their scale is limited and the collection costs are high. In contrast, while synthetic datasets are larger in scale, their acoustic simulations often lack realism, making it difficult to accurately reflect the acoustic characteristics of real environments.
The introduction of the SonicSim platform effectively addresses these issues. It can simulate various complex acoustic environments, including obstacles, room geometries, and the absorption, reflection, and scattering properties of different materials, and supports user-defined scene layouts, sound source and microphone positions, and microphone types.
Based on the SonicSim platform, the research team also constructed a large multi-scene mobile sound source dataset named SonicSet.
This dataset utilizes speech and noise data from LibriSpeech, Freesound Dataset50k, and Free Music Archive, along with 90 real scenes from the Matterport3D dataset, containing rich speech, environmental noise, and music noise data.
The construction of the SonicSet dataset is highly automated, capable of randomly generating sound source and microphone positions as well as sound source movement trajectories, ensuring the authenticity and diversity of the data.
To validate the effectiveness of the SonicSim platform and SonicSet dataset, the research team conducted extensive experiments on speech separation and speech enhancement tasks.
The results show that models trained on the SonicSet dataset achieved better performance on real-world recorded datasets, proving that the SonicSim platform can effectively simulate real-world acoustic environments, providing strong support for research in the field of speech processing.
The release of the SonicSim platform and SonicSet dataset brings new breakthroughs to the field of speech processing. With continuous improvements in simulation tools and optimization of model algorithms, the application of speech processing technology in complex environments will be further advanced in the future.
However, the realism of the SonicSim platform is still limited by the details of the 3D scene modeling. When the imported 3D scene has missing or incomplete structures, the platform cannot accurately simulate the reverberation effects in the current environment.
Paper link: https://arxiv.org/pdf/2410.01481