ControlMM: Multi-modal Input for Full-body Motion Generation from Text, Speech, and Music

AIbase基地

Published inAI News · 4 min read · Aug 6, 2024

266

The Chinese University of Hong Kong and Tencent have jointly launched a new technology framework, ControlMM, which brings a significant breakthrough to full-body motion generation. This technology supports multi-modal inputs such as text, speech, and music, enabling the generation of full-body motions that match the content.

Product Entry: https://top.aibase.com/tool/controlmm

The advent of ControlMM aims to address various challenges in full-body multi-modal motion generation controlled by text, speech, or music. These challenges include motion distribution drift across different generation scenarios, complex optimization of mixed conditions at varying granularities, and inconsistent motion formats in existing datasets.

To effectively tackle these challenges, researchers have proposed a series of innovative methods. Firstly, ControlMM-Attn is used for parallel modeling of static and dynamic human topology maps to efficiently learn and transfer motion knowledge across different motion distributions.

Secondly, ControlMM employs a coarse-to-fine training strategy, including Phase 1 text-to-motion pre-training for semantic generation, and Phase 2 multi-modal control adaptation for different low-level granular conditions.

Additionally, to address the limitation of inconsistent motion formats in existing benchmarks, ControlMM-Bench has been introduced. This is the first publicly available multi-modal full-body human motion generation benchmark based on a unified full-body SMPL-X format.

Through extensive experiments, ControlMM has demonstrated superior performance in various standard motion generation tasks, whether in Text-to-Motion, Speech-to-Gesture, or Music-to-Dance. Compared to baseline models, ControlMM shows significant advantages in controllability, sequentiality, and motion plausibility.

Key Features of ControlMM:
1. **Multi-Modal Control**: ControlMM supports full-body motion generation through various modalities such as text, speech, and music, enhancing control capabilities and adaptability.
2. **Unified Framework**: Adopting a unified ControlMM framework integrates multiple motion generation tasks, improving generation efficiency.
3. **Stage-Based Training Strategy**: A coarse-to-fine training strategy is implemented, starting with text-to-motion pre-training and followed by adaptation to low-level control signals, ensuring effectiveness under different granular conditions.
4. **Efficient Motion Knowledge Learning**: The ControlMM-Attn module models dynamic and static human topology maps in parallel, optimizing motion sequence representation and enhancing the accuracy of motion generation.
5. **New Benchmark Introduction**: The introduction of ControlMM-Bench provides the first publicly available multi-modal full-body motion generation benchmark based on a unified SMPL-X format, aiding research and application in the field.
6. **Superior Generation Performance**: ControlMM demonstrates leading performance in standard motion generation tasks, including controllability, continuity, and motion plausibility.

AI Daily: Tencent Yuanbao Upgrades for One-Phrase Image and Video Search; WeChat Pay MCP Launches; Google Unveils Veo 3 Globally

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Each day, we present you with the latest content in the AI field, focusing on developers to help you understand technical trends and innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1. Tencent Yuanbao upgrades again: one phrase search, images and videos appear instantly, making information retrieval more intuitive! The upgraded features of Tencent Yuanbao make information retrieval more intuitive and efficient. Users just need to ask a question in one phrase to get text and image results.

Google Launches New Veo 3 Video Generation Model Globally

Google announced the global launch of its latest video generation model, Veo3. This long-anticipated release has generated great excitement among users, as Veo3 is now available to Gemini users in over 159 countries, offering a new video creation experience. The key feature of the Veo3 video generation model is its ability to generate videos up to eight seconds long based on simple text prompts. According to Google, this technology is designed for creative users, especially those on social media who increasingly demand short-form content.

Tencent Yuanbao Upgrades Again: One-Phrase Search, Images and Videos Instantly Displayed, Information Access More Intuitive!

The smart assistant Yuanbao announced today a major upgrade to its core search function, introducing the new feature 'More Can Be Searched with Just One Phrase.' Now, users only need to ask a simple question, and Yuanbao will intelligently match and display content from images and video accounts, making information access more abundant and intuitive than ever before. In the past, Yuanbao could easily handle daily needs such as weather inquiries, stock price checks, and location searches. This upgrade takes Yuanbao's intelligent search capabilities to a new level. Whether you want to learn a new skill or solve a small problem in life, Yuanbao can integrate text

Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

Large Language Models (LLMs) have achieved significant progress in complex reasoning tasks by combining task prompts with large-scale reinforcement learning (RL), as demonstrated by models like Deepseek-R1-Zero, which directly apply reinforcement learning to base models, showcasing strong reasoning capabilities. However, this success is difficult to replicate across different base model families, especially within the Llama series. This raises a core question: what factors lead to inconsistent performance of different base models during reinforcement learning? How does reinforcement learning perform in

Byte EX-4D Technology Achieves Monocular Video 4D Conversion, Unlocking High-Quality Content Generation Under Extreme Perspectives

The EX-4D (Extreme Viewpoint 4D Video Generation) technology, developed by the research team tau-yihouxiang, is a groundbreaking innovation in video generation that is gaining widespread attention globally. This technology aims to transform monocular videos into controllable 4D experiences, particularly demonstrating excellent performance under extreme camera angles. The core of the EX-4D technology lies in its unique 'depth watertight mesh' construction method. This novel geometric representation

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

ControlMM: Multi-modal Input for Full-body Motion Generation from Text, Speech, and Music

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: Tencent Yuanbao Upgrades for One-Phrase Image and Video Search; WeChat Pay MCP Launches; Google Unveils Veo 3 Globally

Google Launches New Veo 3 Video Generation Model Globally

Tencent Yuanbao Upgrades Again: One-Phrase Search, Images and Videos Instantly Displayed, Information Access More Intuitive!

Google Veo 3 Video Generation Model Now Available to Pro/Ultra Subscribers, Will Add Photo-to-Video Function

Lovart Domestic Version Star Flow Agent Launches, Batch Posters and Chinese Font Support are Perfectly Compatible

2025 Global AI Talent Rankings: The Rise of Chinese Experts and Emerging Forces

A Daily: Bilibili Upgrades Anime Video Generation Model AniSora V3; ByteDance Open Sources 4D Video Generation Framework EX-4D; DeepSWE Open Sources AI Agent System Rises to the Top

Bilibili Open-Sourced Anime Video Generation Model AniSora V3 Version - One-Click Generation of Various Style Anime Video Shots

Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

Byte EX-4D Technology Achieves Monocular Video 4D Conversion, Unlocking High-Quality Content Generation Under Extreme Perspectives