
Audio-Visual Source Separation System

CommonProductMusicAudio separationAudio-visual analysis
PixelPlayer is a system that can, by watching a large number of unmarked videos, learn to locate the image regions producing sound and separate the input audio into a set of components representing the sound of each pixel. Our method leverages the natural synchronous features of the visual and auditory modalities to learn a joint model for parsing sound and images without the need for additional human labeling. The system is trained using a large number of training videos featuring solo and duet performances of different instrumental combinations. There is no supervision on which instruments appear, where they are, and what sounds they produce for each video. In the testing phase, the system's input consists of videos with performances of different instruments and monaural auditory inputs. The system performs audio-visual source separation and localization, separating the input audio signal into N sound channels, each corresponding to a different instrumental category. In addition, the system can localize sound and assign different audio waveforms to each pixel in the input video.