MaskVAT
A video-to-audio generation model that enhances synchronization
CommonProductVideoVideo-to-audioSynchronization
MaskVAT is a video-to-audio (V2A) generation model that utilizes visual features from video to generate realistic sounds that match the scene. This model places particular emphasis on synchronizing the starting points of sounds with visual actions to avoid unnatural synchronization issues. MaskVAT combines a high-quality, full-band universal audio codec with a sequence-to-sequence masking generation model, achieving competitive performance comparable to non-codec audio generation models while ensuring high audio quality, semantic matching, and temporal synchronization.