whisper-diarization
Automatic speech recognition and speaker segmentation based on OpenAI Whisper
CommonProductProgrammingSpeech RecognitionSpeaker Segmentation
whisper-diarization is an open-source project that integrates Whisper's automatic speech recognition (ASR) capabilities, Voice Activity Detection (VAD), and speaker embedding technology. It improves the accuracy of speaker embeddings by extracting the audible portions of audio, generating transcriptions using Whisper, and correcting timestamps and alignment through WhisperX to minimize segmentation errors caused by temporal offsets. Subsequently, MarbleNet is employed for VAD and segmentation to eliminate silence, while TitaNet is used to extract speaker embeddings for identifying speakers in each segment. Finally, the results are correlated with the timestamps generated by WhisperX, determining the speaker of each word based on timestamps and realigning with a punctuation model to compensate for minor timing offsets.
whisper-diarization Visit Over Time
Monthly Visits
515580771
Bounce Rate
37.20%
Page per Visit
5.8
Visit Duration
00:06:42