Recently, ByteDance announced the launch of an artificial intelligence system called INFP, which enables static portrait photos to "speak" and respond through audio input. Unlike traditional technologies, INFP does not require manual specification of speaking and listening roles; the system can automatically determine roles based on the flow of conversation.

image.png

The workflow of INFP consists of two main steps. The first step, known as "motion-based head imitation," involves the system analyzing facial expressions and head movements during conversations to extract details from videos. This motion data is then converted into a format that can be used for subsequent animation, allowing static photos to match the original person's movements.

The second step is "audio-guided motion generation," where the system generates natural movement patterns based on audio input. The research team developed a tool called the "motion guider," which analyzes the audio from both parties in a conversation to create speaking and listening motion patterns. Subsequently, an AI component named the diffusion transformer progressively optimizes these patterns to generate smooth and realistic movements that perfectly align with the audio content.

To effectively train the system, the research team also established a dialogue dataset named DyConv, which compiles over 200 hours of real conversation videos. Compared to existing dialogue databases (such as ViCo and RealTalk), DyConv has unique advantages in emotional expression and video quality.

ByteDance stated that INFP outperforms existing tools in several key areas, particularly in lip movements matching speech, preserving individual facial features, and creating diverse natural motions. Additionally, the system also performs exceptionally well when generating videos that feature only the listener in a conversation.

Although INFP currently only supports audio input, the research team is exploring the possibility of extending the system to images and text, with the future goal of creating realistic animations of full-body characters. However, considering that such technology could be used to create fake videos and spread misinformation, the research team plans to restrict the core technology to research institutions, similar to how Microsoft manages its advanced voice cloning system.

This technology is part of ByteDance's broader AI strategy, leveraging its popular applications TikTok and CapCut, which provide a vast platform for AI innovation applications.

Project link: https://grisoon.github.io/INFP/

Key Points:

🎤 INFP allows static portraits to "speak" through audio and automatically determines conversation roles.

🎥 The system works in two steps: first extracting motion details from human conversations, and then converting audio into natural motion patterns.

📊 ByteDance's DyConv dataset contains over 200 hours of high-quality conversation videos, helping to enhance system performance.