Wuhan University, in collaboration with the Nine-Sky Artificial Intelligence Team of China Mobile and Duke Kunshan University, has released the VoxBlink2 audiovisual speaker recognition dataset, which is based on YouTube data and contains over 110,000 hours of content. This dataset includes 9,904,382 high-quality audio clips and their corresponding video clips, sourced from 111,284 users on YouTube, making it the largest publicly available audiovisual speaker recognition dataset to date. The release of this dataset aims to enrich the open-source speech corpus and support the training of large-scale voiceprint models.
The VoxBlink2 dataset is mined through the following steps:
Candidate Preparation: Collect multilingual keyword lists, retrieve user videos, and select the first minute of each video for processing.
Facial Extraction & Detection: Extract video frames at high frame rates, use MobileNet for facial detection to ensure the video track contains only a single speaker.
Facial Recognition: Pre-trained facial recognizers identify frames to ensure the audio and video clips come from the same person.
Active Speaker Detection: Utilize lip movement sequences and audio to output speech segments through a multimodal active speaker detector, removing overlapping detection of multiple speakers.
To enhance data accuracy, an additional bypass step with an in-set facial recognizer was introduced, which increased the accuracy from 72% to 92% through rough facial extraction, facial verification, facial sampling, and training.
VoxBlink2 also released voiceprint models of various sizes, including 2D convolutional models based on ResNet and temporal models based on ECAPA-TDNN, as well as an ultra-large model based on ResNet293 with a Simple Attention Module. These models, post-processed on the Vox1-O dataset, can achieve an EER of 0.17% and a minDCF of 0.006%.
Dataset Website: https://VoxBlink2.github.io
Dataset Download Method: https://github.com/VoxBlink2/ScriptsForVoxBlink2
Metafiles and Models: https://drive.google.com/drive/folders/1lzumPsnl5yEaMP9g2bFbSKINLZ-QRJVP
Paper Address: https://arxiv.org/abs/2407.11510