Recently, a new technology called INFP (Interactive, Natural, Flash and Person-generic) has garnered widespread attention. This technology aims to address the current lack of interaction in AI avatars during two-person conversations, allowing virtual characters to dynamically adjust their expressions and movements based on the dialogue, just like real people.
Goodbye to "Solo Comedy," Hello to "Duet Performance"
Previous AI avatars could either only talk to themselves, like a "solo comedian," or just sit there passively, like a "wooden person," without any feedback. However, our conversations as humans are not like that! When we speak, we look at each other, nod, frown, and even joke around; that’s true interaction!
The emergence of INFP aims to completely change this awkward situation! It acts like a conductor for a "duet performance," dynamically adjusting the AI avatar's expressions and movements based on the audio of your conversation, making you feel as if you are talking to a real person!
INFP's "Secret Weapons": Two Essential Skills!
INFP is so powerful mainly due to its two "secret weapons":
Motion-Based Head Imitation:
It first learns human expressions and movements from a vast amount of real conversation videos, like a "master of imitation," compressing these complex behaviors into "motion codes."
To make the movements more realistic, it pays special attention to the eyes and mouth, giving them a "close-up" treatment.
It also uses facial keypoints to assist in generating expressions, ensuring the accuracy and naturalness of movements.
Then, it applies these "motion codes" to a static avatar, bringing it to life instantly, almost like magic!
Audio-Guided Motion Generation:
This "generator" is even more impressive; it can understand the audio of your conversation with the AI, like a "sound localization expert."
It analyzes who is speaking and who is listening in the audio, dynamically adjusting the AI avatar's state, allowing it to seamlessly switch between "speaking" and "listening" without any manual role switching.
It is equipped with two "memory banks" that store various actions for "speaking" and "listening," like two "treasure chests" ready to provide the most suitable actions at any time.
It can also adjust the AI avatar's emotions and attitudes based on your vocal style, making the conversation more lively and engaging.
Finally, it employs a technique called "diffusion model" to turn these actions into smooth, natural animations, eliminating any stutter.
DyConv: A Huge Dialogue Dataset Full of "Gossip"!
To train the INFP "super AI," researchers specifically collected a massive dialogue dataset called DyConv!
This dataset contains over 200 hours of dialogue videos, featuring people from all walks of life discussing a wide range of topics, making it a "gossip haven."
The video quality of the DyConv dataset is very high, ensuring that everyone’s face is clearly visible.
Researchers also used advanced speech separation models to extract each person's voice individually, making it easier for the AI to learn.
INFP's "Diverse Skills": Not Just Conversations, But Also...
INFP can shine not only in two-person conversations but also in other scenarios:
"Listening Mode": It can respond with appropriate expressions and movements based on what the other person is saying, just like a "dedicated student."
"Talking Head Mode": It can create realistic lip movements based on audio, like a "master of lip-syncing."
To prove INFP's power, researchers conducted numerous experiments, and the results showed:
On various metrics, INFP outperformed other similar methods, achieving outstanding results in video quality, lip synchronization, and action diversity.
In terms of user experience, participants unanimously agreed that the videos generated by INFP were more natural, lively, and had a higher degree of matching with the audio.
Researchers also conducted ablation experiments, demonstrating that each module in INFP is essential.
Project link: https://grisoon.github.io/INFP/