DenseAV

A self-supervised audio-visual feature alignment model.

CommonProductVideoSelf-Supervised LearningAudio-Visual Alignment
DenseAV is a novel dual-encoder localization architecture that learns high-resolution, semantically meaningful audio-visual alignment features by observing videos. It can discover the "meaning" of words and the "location" of sounds without requiring explicit localization supervision, and automatically discovers and distinguishes between these two types of associations. DenseAV's localization capability stems from a new multi-head feature aggregation operator, which directly compares dense image and audio representations through contrastive learning. Additionally, DenseAV significantly outperforms previous art on semantic segmentation tasks and surpasses ImageBind in cross-modal retrieval using less than half the parameters.
Visit

DenseAV Visit Over Time

Monthly Visits

7200

Bounce Rate

61.43%

Page per Visit

2.5

Visit Duration

00:06:00

DenseAV Visit Trend

DenseAV Visit Geography

DenseAV Traffic Sources