DenseAV

A self-supervised audio-visual feature alignment model.

CommonProductVideoSelf-Supervised LearningAudio-Visual Alignment

DenseAV is a novel dual-encoder localization architecture that learns high-resolution, semantically meaningful audio-visual alignment features by observing videos. It can discover the "meaning" of words and the "location" of sounds without requiring explicit localization supervision, and automatically discovers and distinguishes between these two types of associations. DenseAV's localization capability stems from a new multi-head feature aggregation operator, which directly compares dense image and audio representations through contrastive learning. Additionally, DenseAV significantly outperforms previous art on semantic segmentation tasks and surpasses ImageBind in cross-modal retrieval using less than half the parameters.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

DenseAV

DenseAV Visit Over Time

DenseAV Visit Trend

DenseAV Visit Geography

DenseAV Traffic Sources

DenseAV Alternatives

DenseAV — A self-supervised audio-visual feature alignment model.

MuVi — A video-to-music generation framework that achieves semantic alignment and rhythmic synchronization of audio and visual content.

Sparsh — Self-supervised tactile representation for vision-based tactile sensing.

SHMT — A self-supervised hierarchical makeup transfer technology based on latent diffusion models

PixelPlayer — Audio-Visual Source Separation System

ManiWAV — Robot manipulation learning from wild audio-visual data

AV-HuBERT — A state-of-the-art auto-referenced framework for agricultural, environmental, and energy innovations.

LuDe — AI-Powered Audio-Visual Generation Tool

ReSyncer — Unified audio-visual synchronization for facial performers

33 Subtitle — Accurately identifies audio-visual content as text or SRT subtitles

Supervised AI — Build code-free supervised learning models.

Wanxiang Tianmu — An AI tool with powerful audio-visual multimedia material generation and understanding capabilities.

ELLA — An LLM-enhanced semantic alignment adapter for diffusion models

AniTalker — Transforms static portrait images and input audio into vibrant animated dialogue videos

Mikey Smart — An all-in-one AI-powered audio-visual service providing voice translation, voice customization, and voiceover.

A Vision Check-up — Learns string relationships between models, examines the visual world

vta-ldm — Video to Audio Generation Model

Segment Anything Model 2 — A foundational model for visual segmentation of images and videos.

miqu-1-70b — Miqu 1-70b is an open-source large language model.

prism-alignment — Explore the preferences and value alignment of large language models.

Image Matting — An online image segmentation tool based on deep learning.

ObjectDrop — A method for realistic object removal and insertion through counting fact datasets and self-supervised learning.

Wikipedia Semantic Search — Explore the semantic search capabilities of Wikipedia.

1.58-bit FLUX — A state-of-the-art text-to-image generation model utilizing 1.58-bit quantization.

DiariZen — A toolkit for speaker segmentation.

Video-LLaVA — Learns joint visual representations through prefix projection alignment.

MimicBrush — Zero-shot image editing, mimic the style of reference images with one click

Segment Anything 2 for Surgical Video Segmentation — An advanced model for surgical video segmentation.

Semantic Search on Wikipedia with Upstash Vector — A semantic search tool for Wikipedia based on Upstash Vector.

NotebookLM Audio Overview — Transforms documents into AI-generated audio discussions for easier learning and retention.

DenseAV

DenseAV Visit Over Time

DenseAV Visit Trend

DenseAV Visit Geography

DenseAV Traffic Sources

DenseAV Alternatives

DenseAV — A self-supervised audio-visual feature alignment model.

MuVi — A video-to-music generation framework that achieves semantic alignment and rhythmic synchronization of audio and visual content.

Sparsh — Self-supervised tactile representation for vision-based tactile sensing.

SHMT — A self-supervised hierarchical makeup transfer technology based on latent diffusion models

PixelPlayer — Audio-Visual Source Separation System

ManiWAV — Robot manipulation learning from wild audio-visual data

AV-HuBERT — A state-of-the-art auto-referenced framework for agricultural, environmental, and energy innovations.

LuDe — AI-Powered Audio-Visual Generation Tool

ReSyncer — Unified audio-visual synchronization for facial performers

33 Subtitle — Accurately identifies audio-visual content as text or SRT subtitles

GEO Services