The Chinese University of Hong Kong and Tencent have jointly launched a new technology framework, ControlMM, which brings a significant breakthrough to full-body motion generation. This technology supports multi-modal inputs such as text, speech, and music, enabling the generation of full-body motions that match the content.

image.png

Product Entry: https://top.aibase.com/tool/controlmm

The advent of ControlMM aims to address various challenges in full-body multi-modal motion generation controlled by text, speech, or music. These challenges include motion distribution drift across different generation scenarios, complex optimization of mixed conditions at varying granularities, and inconsistent motion formats in existing datasets.

To effectively tackle these challenges, researchers have proposed a series of innovative methods. Firstly, ControlMM-Attn is used for parallel modeling of static and dynamic human topology maps to efficiently learn and transfer motion knowledge across different motion distributions.

Secondly, ControlMM employs a coarse-to-fine training strategy, including Phase 1 text-to-motion pre-training for semantic generation, and Phase 2 multi-modal control adaptation for different low-level granular conditions.

Additionally, to address the limitation of inconsistent motion formats in existing benchmarks, ControlMM-Bench has been introduced. This is the first publicly available multi-modal full-body human motion generation benchmark based on a unified full-body SMPL-X format.

Through extensive experiments, ControlMM has demonstrated superior performance in various standard motion generation tasks, whether in Text-to-Motion, Speech-to-Gesture, or Music-to-Dance. Compared to baseline models, ControlMM shows significant advantages in controllability, sequentiality, and motion plausibility.

Key Features of ControlMM:

1. **Multi-Modal Control**: ControlMM supports full-body motion generation through various modalities such as text, speech, and music, enhancing control capabilities and adaptability.

2. **Unified Framework**: Adopting a unified ControlMM framework integrates multiple motion generation tasks, improving generation efficiency.

3. **Stage-Based Training Strategy**: A coarse-to-fine training strategy is implemented, starting with text-to-motion pre-training and followed by adaptation to low-level control signals, ensuring effectiveness under different granular conditions.

4. **Efficient Motion Knowledge Learning**: The ControlMM-Attn module models dynamic and static human topology maps in parallel, optimizing motion sequence representation and enhancing the accuracy of motion generation.

5. **New Benchmark Introduction**: The introduction of ControlMM-Bench provides the first publicly available multi-modal full-body motion generation benchmark based on a unified SMPL-X format, aiding research and application in the field.

6. **Superior Generation Performance**: ControlMM demonstrates leading performance in standard motion generation tasks, including controllability, continuity, and motion plausibility.