Wuhan University Collaborates with China Mobile and Jiutian AI Team to Release Open-source Audio-Video Speaker Recognition Dataset VoxBlink2

AIbase基地

Published inAI News · 3 min read · Jul 26, 2024

360

Wuhan University, in collaboration with the Nine-Sky Artificial Intelligence Team of China Mobile and Duke Kunshan University, has released the VoxBlink2 audiovisual speaker recognition dataset, which is based on YouTube data and contains over 110,000 hours of content. This dataset includes 9,904,382 high-quality audio clips and their corresponding video clips, sourced from 111,284 users on YouTube, making it the largest publicly available audiovisual speaker recognition dataset to date. The release of this dataset aims to enrich the open-source speech corpus and support the training of large-scale voiceprint models.

WeChat Screenshot_20240726092359.png

The VoxBlink2 dataset is mined through the following steps:

Candidate Preparation: Collect multilingual keyword lists, retrieve user videos, and select the first minute of each video for processing.
Facial Extraction & Detection: Extract video frames at high frame rates, use MobileNet for facial detection to ensure the video track contains only a single speaker.
Facial Recognition: Pre-trained facial recognizers identify frames to ensure the audio and video clips come from the same person.
Active Speaker Detection: Utilize lip movement sequences and audio to output speech segments through a multimodal active speaker detector, removing overlapping detection of multiple speakers.

To enhance data accuracy, an additional bypass step with an in-set facial recognizer was introduced, which increased the accuracy from 72% to 92% through rough facial extraction, facial verification, facial sampling, and training.

VoxBlink2 also released voiceprint models of various sizes, including 2D convolutional models based on ResNet and temporal models based on ECAPA-TDNN, as well as an ultra-large model based on ResNet293 with a Simple Attention Module. These models, post-processed on the Vox1-O dataset, can achieve an EER of 0.17% and a minDCF of 0.006%.

Dataset Website: https://VoxBlink2.github.io

Dataset Download Method: https://github.com/VoxBlink2/ScriptsForVoxBlink2

Metafiles and Models: https://drive.google.com/drive/folders/1lzumPsnl5yEaMP9g2bFbSKINLZ-QRJVP

Paper Address: https://arxiv.org/abs/2407.11510

VoxBlink2 Audio-Video Speaker Recognition China Mobile Wuhan University

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Baidu Launches the World's First Chinese Audio-Visual Generation Model MuseSteamer, Revolutionizing the Creative Process

Jul 2, 2025

220

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Welcome to the [AI Daily] section! Here is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and learn about innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1、Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand audio and directly generate natural speech. Step-Audio-AQAA is an open source end-to-end speech large model,

Jul 2, 2025

240

Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand Audio and Generate Natural Speech Directly

Jul 2, 2025

220

Zhejiang University and Alibaba jointly launch OmniAvatar: A full-body digital human model driven by audio makes a stunning debut

Zhejiang University and Alibaba have jointly launched the new audio-driven model OmniAvatar, marking a new height in digital human technology. This model is driven by audio and can generate natural and smooth full-body digital human videos, especially showing outstanding performance in singing scenarios, with mouth movements and audio lip synchronization being precise and realistic. OmniAvatar supports fine control of generation details through text prompts, allowing users to customize the range of character movements, background environment, and emotional expressions, demonstrating a high level of flexibility. In addition, this model can generate virtual characters interacting with objects

Jul 2, 2025

200

Baidu Launches the HuiXiang Platform and MuseSteamer: AI-Generated Video with a Single Image to Create Professional-Level Movies!

At today's Baidu AI DAY technology open day, Baidu's commercial R&D team officially launched its self-developed video generation model MuseSteamer and the accompanying video product platform **HuiXiang**. This innovation aims to create a comprehensive video generation solution by combining generative AI and multimodal technology, to meet the strong demand for native content production in scenarios such as search, advertising, and recommendations. The MuseSteamer video generation model series is rich, currently including Turbo, Lite, Pro, and

Jul 2, 2025

500

Baidu Launches Self-Developed Video Generation Model MuseSteamer and Video Product Platform HuiXiang

At the recent Baidu AIDAY Technology Open Day event, the Baidu Commercial R&D team officially announced two major innovative achievements: the self-developed video generation model MuseSteamer and the new video product platform "HuiXiang." MuseSteamer, as Baidu's self-developed video generation model, marks a significant progress in Baidu's artificial intelligence generated content (AIGC) field, especially in video creation. The simultaneous release of the video product platform HuiXiang will provide users with an integrated tool.

Jul 2, 2025

230

Wuhan Launches the First AI Food Delivery Vehicle in China, Significantly Improving Delivery Efficiency

The Hanyang District of Wuhan welcomed the launch of the first food delivery vehicle in China equipped with AI technology — the Zhinyin Vehicle. The introduction of this intelligent delivery vehicle marks a major technological innovation in the food delivery industry. The release of the Zhinyin Vehicle coincided with the launch of the country's first dedicated management system for delivery licenses, further advancing the intelligence of food delivery. It is reported that the Zhinyin Vehicle is equipped with a dual-frequency Beidou chip, enabling intelligent management with person-vehicle binding. It has functions such as identity recognition, automatic speed reduction, route optimization, and full-process traceability. Food delivery rider Tang Xiaosong participated in the trial.

Jul 2, 2025

150

AI Daily: Alibaba Tongyi Launches Qwen-TTS Model; Cursor Now Supports Web and Mobile; ByteDance Unveils Image Synthesis Technology XVerse

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications. Discover new AI products: https://top.aibase.com/1. Qwen-TTS Launches with a Major Breakthrough in Dialect Speech Synthesis, Achieving Realism Close to Human Voices. The Qwen-TTS model, developed by Alibaba's Tongyi team, has made significant breakthroughs in the field of speech synthesis.

Jul 1, 2025

290

Chai-2 Makes a Shocking Debut: AI-Powered Zero-Shot Antibody Design, Accelerating Drug Development by Hundreds of Times

Artificial intelligence once again stirs up the field of drug development! Chai Discovery recently launched a new AI model called Chai-2, which has drawn widespread attention with its breakthrough technology in molecular design. Chai-2 achieves zero-shot antibody design with a success rate of 16%-20%, hundreds of times higher than traditional methods, shortening the drug development cycle from months or even years to just two weeks. Zero-shot antibody design breaks through traditional bottlenecks. Chai-2 is a multi-modal generative AI model developed by Chai Discovery, specifically designed for...

Jul 1, 2025

460

Cursor releases Web version of AI coding tool, expanding to browsers and mobile devices

Jul 1, 2025

230

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Wuhan University Collaborates with China Mobile and Jiutian AI Team to Release Open-source Audio-Video Speaker Recognition Dataset VoxBlink2

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Baidu Launches the World's First Chinese Audio-Visual Generation Model MuseSteamer, Revolutionizing the Creative Process

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand Audio and Generate Natural Speech Directly

Zhejiang University and Alibaba jointly launch OmniAvatar: A full-body digital human model driven by audio makes a stunning debut

Baidu Launches the HuiXiang Platform and MuseSteamer: AI-Generated Video with a Single Image to Create Professional-Level Movies!

Baidu Launches Self-Developed Video Generation Model MuseSteamer and Video Product Platform HuiXiang

Wuhan Launches the First AI Food Delivery Vehicle in China, Significantly Improving Delivery Efficiency

AI Daily: Alibaba Tongyi Launches Qwen-TTS Model; Cursor Now Supports Web and Mobile; ByteDance Unveils Image Synthesis Technology XVerse

Chai-2 Makes a Shocking Debut: AI-Powered Zero-Shot Antibody Design, Accelerating Drug Development by Hundreds of Times

Cursor releases Web version of AI coding tool, expanding to browsers and mobile devices