UniMuMo

Unified model for text, music, and motion generation.

CommonProductMusicArtificial IntelligenceMachine Learning
UniMuMo is a multimodal model capable of taking any text, music, and motion data as input conditions to generate outputs across all three modalities. The model bridges these modalities by converting music, motion, and text into token-based representations through a unified encoder-decoder architecture. By fine-tuning existing pretrained unimodal models, it significantly reduces computational requirements. UniMuMo has achieved competitive results in all unidirectional generation benchmarks across music, motion, and text modalities.
Visit

UniMuMo Alternatives