DenseAV
A self-supervised audio-visual feature alignment model.
CommonProductVideoSelf-Supervised LearningAudio-Visual Alignment
DenseAV is a novel dual-encoder localization architecture that learns high-resolution, semantically meaningful audio-visual alignment features by observing videos. It can discover the "meaning" of words and the "location" of sounds without requiring explicit localization supervision, and automatically discovers and distinguishes between these two types of associations. DenseAV's localization capability stems from a new multi-head feature aggregation operator, which directly compares dense image and audio representations through contrastive learning. Additionally, DenseAV significantly outperforms previous art on semantic segmentation tasks and surpasses ImageBind in cross-modal retrieval using less than half the parameters.
DenseAV Visit Over Time
Monthly Visits
7200
Bounce Rate
61.43%
Page per Visit
2.5
Visit Duration
00:06:00