Alibaba's Tongyi Lab recently released a new digital human video generation model called "OmniTalker." This innovative model's core functionality lies in its ability to precisely mimic a person's expressions, voice, and speaking style from a single uploaded reference video. Compared to traditional digital human creation processes, OmniTalker significantly reduces production costs while enhancing the realism and interactivity of the generated content, meeting a wide range of application needs.
OmniTalker is remarkably easy to use. Users simply upload a reference video to the platform to generate synchronized audio and video content. The project is currently available for free on platforms like Modai Community and Hugging Face, offering various templates for user customization. To showcase the technology's power, Alibaba's Tongyi Lab presented several example videos, leaving viewers virtually unable to distinguish AI-generated footage from real recordings – a truly impressive feat.
The model's development stems from the rapid advancements in large language models in recent years, with virtual anchors and virtual assistants becoming increasingly prevalent. However, previous text-driven digital human generation research has been relatively limited, and traditional methods, often employing cascaded pipelines, frequently lead to issues like audio-visual desynchronization and inconsistent speaking styles. OmniTalker overcomes these technical bottlenecks by introducing a dual-branch DiT architecture, simultaneously generating synchronized speech and video from both text and reference videos.
Architecturally, OmniTalker comprises three core components. First, the model extracts audio and visual features, ensuring perfect temporal synchronization. Second, a multi-modal feature fusion module enhances the integration of audio and video. Finally, a pre-trained decoder efficiently converts the synthesized audio-video features into their original formats, guaranteeing high-quality output.
Comparative experimental data demonstrates OmniTalker's excellent performance in audio generation and visual effects, exhibiting lower error rates and higher voice similarity, further proving its robust zero-shot capabilities.
Paper: https://arxiv.org/abs/2504.02433v1
Project Page: https://humanaigc.github.io/omnitalker
Demo Page: https://huggingface.co/spaces/Mrwrichard/OmniTalker