Video-Foley

A system for synchronized generation of sound from video

CommonProductProductivityVideo Sound SynthesisSelf-Supervised Learning
Video-Foley is an innovative system for generating sound from video. It employs root mean square (RMS) as a temporal event condition, combined with semantic tonal prompts (audio or text), to achieve high control and synchronization in video sound synthesis. The system utilizes an unsupervised learning framework that requires no manual labeling, consisting of two stages: Video2RMS and RMS2Sound, incorporating novel concepts such as RMS discretization and RMS-ControlNet, in conjunction with a pre-trained text-to-audio model. Video-Foley achieves state-of-the-art performance in aligning and controlling sound timing, intensity, timbre, and detail.
Visit