New Breakthrough in Multimodal Models: Fei-Fei Li's Team Unifies Actions and Language, Not Only Understanding Commands but also Reading Implicit Emotions

AIbase基地

Published inAI News · 8 min read · Dec 18, 2024

344

The Li Fei-Fei team has introduced a new multimodal model that can understand and generate human actions. By integrating a language model, it achieves unified processing of both verbal and non-verbal communication. This groundbreaking research enables machines not only to comprehend human instructions but also to interpret the emotions conveyed through actions, facilitating more natural human-computer interactions.

The core of this model lies in its multimodal language model framework, which can receive various forms of input, including audio, actions, and text, and output the desired modal data. Combined with a generative pre-training strategy, the model demonstrates exceptional performance across multiple tasks. For instance, in collaborative speech-gesture generation, the model not only surpasses existing technological levels but also significantly reduces the amount of data required for training. Additionally, the model unlocks new application scenarios, such as editable gesture generation and emotion prediction through actions.

Human communication is inherently multimodal, encompassing both verbal and non-verbal cues such as speech, facial expressions, and body posture. This model's ability to understand these multimodal behaviors is crucial for creating virtual characters that can communicate naturally in applications such as games, movies, and virtual reality. However, existing action generation models are often limited to specific input modalities (speech, text, or action data) and fail to fully leverage the diversity of available data.

The model utilizes a language model to unify verbal and non-verbal communication for three main reasons:

The language model naturally connects different modalities.

Speech is highly semantic, and tasks such as modeling responses to jokes require strong semantic reasoning capabilities.

The language model has gained powerful semantic understanding through extensive pre-training.

To achieve this, the research team first segmented the body into different parts (face, hands, upper body, lower body) and labeled actions for each part individually. Combined with a tokenizer for text and speech, any modal input can be represented as a series of tokens for the language model to use. The model employs a two-phase training process: first, pre-training to align various modalities with combined body actions, as well as aligning audio and text. Subsequently, downstream tasks are transformed into instructions, and the model is trained on these instructions to follow various task directives.

The model performs excellently in the BEATv2 collaborative speech-gesture generation benchmark, far exceeding existing models. The effectiveness of the pre-training strategy has also been validated, especially in data-scarce situations, demonstrating strong generalization capabilities. By performing post-training on speech-action and text-action tasks, the model can not only follow audio and text prompts but also achieve new functionalities such as predicting emotions from action data.

Technically, the model employs modality-specific tokenizers to process various input modalities. Specifically, it trains a combined body movement VQ-VAE that converts facial, hand, upper body, and lower body actions into discrete tokens. These modality-specific vocabularies (audio and text) are merged into a unified multimodal vocabulary. During training, mixed tokens from different modalities are used as input, and outputs are generated through an encoder-decoder language model.

The model also utilizes a multimodal vocabulary to convert different modal data into a unified format for processing. In the pre-training phase, the model learns the correspondences between different modalities by performing inter-modal conversion tasks. For example, the model can learn to convert upper body actions into lower body actions or convert audio into text. Additionally, the model learns the temporal evolution of actions by randomly masking certain action frames.

In the post-training phase, the model is fine-tuned using paired data to perform downstream tasks such as collaborative speech-gesture generation or text-to-action generation. To enable the model to follow natural human instructions, the researchers built a multi-task instruction-following template that converts tasks such as audio-to-action, text-to-action, and emotion-to-action into instructions. The model also has the capability to edit gestures, generating coordinated full-body actions based on text and audio prompts.

Finally, the model has unlocked a new capability of predicting emotions from actions, which is significant in fields such as mental health or psychiatry. Compared to other models, this model can more accurately predict the emotions expressed through actions, demonstrating strong body language understanding capabilities.

This research shows that unifying verbal and non-verbal language in human actions is crucial for practical applications, and language models provide a powerful framework for this purpose.

Paper link: https://arxiv.org/pdf/2412.10523v1

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Welcome to the [AI Daily] section! Here is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and learn about innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1、Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand audio and directly generate natural speech. Step-Audio-AQAA is an open source end-to-end speech large model,

Zhejiang University and Alibaba jointly launch OmniAvatar: A full-body digital human model driven by audio makes a stunning debut

Zhejiang University and Alibaba have jointly launched the new audio-driven model OmniAvatar, marking a new height in digital human technology. This model is driven by audio and can generate natural and smooth full-body digital human videos, especially showing outstanding performance in singing scenarios, with mouth movements and audio lip synchronization being precise and realistic. OmniAvatar supports fine control of generation details through text prompts, allowing users to customize the range of character movements, background environment, and emotional expressions, demonstrating a high level of flexibility. In addition, this model can generate virtual characters interacting with objects

Microsoft Launches Groundbreaking Medical AI System MAI-DxO: Diagnostic Accuracy Far Exceeds Human Experts

Microsoft CEO Satya Nadella recently announced on a social platform that Microsoft has officially launched the revolutionary medical AI system MAI-DxO. This innovative system stands out with its unique "model-agnostic" design, allowing it to flexibly adapt to language models of different manufacturers and capabilities, thereby significantly improving their diagnostic performance. More excitingly, MAI-DxO is not only able to simulate the diagnostic process of real doctors, but also demonstrated diagnostic accuracy far exceeding that of professional physicians in tests, while greatly reducing the cost of medical diagnosis. Microsoft has released test data.

{title: OpenAI Advances GPT-4.5 API Deprecation Plan, Sparking Strong Reactions in the Developer Community}

The plan's advancement has triggered dissatisfaction among developers. According to VentureBeat, OpenAI's current API deprecation involves the GPT-4.5 Preview version, which is an intermediary model launched between GPT-4 and GPT-5. Multiple developers have stated that this sudden change has disrupted their development plans, forcing significant adjustments to applications already built based on this API. A developer who wished to remain anonymous commented: We understand the necessity of technological iteration, but the lack of sufficient buffer time in the schedule has brought additional pressure to development teams.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

New Breakthrough in Multimodal Models: Fei-Fei Li's Team Unifies Actions and Language, Not Only Understanding Commands but also Reading Implicit Emotions

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Zhejiang University and Alibaba jointly launch OmniAvatar: A full-body digital human model driven by audio makes a stunning debut

World Robot Dog Competition to Begin: Black Panther 2.0 Challenges Extreme Missions and 100-Meter Human vs. Machine Duel

Microsoft Launches Groundbreaking Medical AI System MAI-DxO: Diagnostic Accuracy Far Exceeds Human Experts

Qwen-TTS Launches with Major Breakthrough in Dialect Speech Synthesis, Realism Comparable to Human Voices

Zhou Hongyi: AI, no matter how powerful, cannot replace three human abilities

Federal Judge Rules First Time AI Training Using Copyrighted Books as Fair Use, Anthropic Wins but Still Faces Piracy Accusations

iFlytek Xinghuo Medical Large Model V2. International Edition Released - Exceeding the Practicality of Human Doctors!

Baidu Launches Dual Digital Human Interactive Live Streaming Studio, Powered by Wenxin Large Model 4.5T for New Breakthroughs in Multimodal Technology

{title: OpenAI Advances GPT-4.5 API Deprecation Plan, Sparking Strong Reactions in the Developer Community}