New AI Audio Technology MMAudio: Automatically Voicing Videos from Video or Text Input

AIbase基地

Published inAI News · 4 min read · Dec 12, 2024

842

Recently, a research team from the University of Illinois at Urbana-Champaign, Sony AI, and Sony Group introduced a new technology called MMAudio, which aims to achieve high-quality video-to-audio synthesis through multimodal joint training.

The core innovation of MMAudio lies in its ability to generate synchronized audio using video and text inputs, thereby expanding the application scenarios for audio generation. It supports inputting either video or text to produce sound effects that align with the video content.

The design of MMAudio allows it to be trained on various audiovisual and audio-text datasets. This multimodal joint training method not only enhances the quality of synthesized audio but also ensures synchronization between the generated audio and video frames. The introduction of this synchronization module significantly improves the accuracy of audio generation, ensuring consistency between audio and video content.

Currently, the MMAudio codebase is still under development. Researchers have stated that the single example inference function is already operational, while the training code will be released in future versions. To facilitate user access, this technology has been tested on the Ubuntu operating system and relevant installation guides are provided. Users need to prepare Python 3.9 or higher, along with appropriate versions of PyTorch and ffmpeg, and can then install MMAudio with a simple command.

There are still some limitations in MMAudio's audio generation, such as occasionally producing unclear speech or background music, and it struggles with certain unfamiliar concepts. The research team believes that increasing the quality of training data can help address these issues. As research continues, MMAudio is expected to further optimize its performance in the future.

Try it out: https://huggingface.co/spaces/hkchengrex/MMAudio

Code: https://github.com/hkchengrex/MMAudio

Key Points:
🌟 MMAudio technology achieves high-quality synthesis of video and audio through multimodal joint training.
📦 Users can easily install MMAudio on Ubuntu for audio generation.
⚠️ The current version has some limitations, but the research team is working to improve performance by increasing training data.

AliTongyi Opensources Audio Generation Model ThinkSound Supporting Chain-of-Thought Reasoning

Recently, the Ali Speech AI team announced the open source of ThinkSound, the world's first audio generation model supporting chain-of-thought reasoning. By introducing the chain-of-thought technology, this model breaks through the limitations of traditional video-to-audio technology in capturing dynamic visuals, achieving high-fidelity and strong synchronized spatial audio generation. This breakthrough marks a leap forward in AI audio technology, moving from 'image配音' to structured understanding of visual content.

Open Source Revolution! Kyutai TTS Launches: Ultra-Low Latency Speech Synthesis, the New Era of AI Voice is Here!

Recently, the French AI laboratory Kyutai announced the official open source of its new text-to-speech model, Kyutai TTS, providing global developers and researchers with a high-performance, low-latency speech synthesis solution. This breakthrough release not only promotes the development of open-source AI technology but also opens up new possibilities for multilingual voice interaction applications. AIbase provides an exclusive analysis of this technological highlight and its potential impact. Ultra-low latency, a new experience in real-time interaction. Kyutai TTS has become an industry standout with its exceptional performance.

Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

Large Language Models (LLMs) have achieved significant progress in complex reasoning tasks by combining task prompts with large-scale reinforcement learning (RL), as demonstrated by models like Deepseek-R1-Zero, which directly apply reinforcement learning to base models, showcasing strong reasoning capabilities. However, this success is difficult to replicate across different base model families, especially within the Llama series. This raises a core question: what factors lead to inconsistent performance of different base models during reinforcement learning? How does reinforcement learning perform in

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

New AI Audio Technology MMAudio: Automatically Voicing Videos from Video or Text Input

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: Alibaba Tongyi Opens Source Audio Generation Model ThinkSound; Google Veo3 Generates Images into Videos; Feishu Announces Several New AI Products

Kunlun Wildfire Launches Skywork-R1V 3.0: Cross-modal Reasoning Capabilities Approaching Those of Human Experts!

AliTongyi Opensources Audio Generation Model ThinkSound Supporting Chain-of-Thought Reasoning

Microsoft, OpenAI, and Anthropic Launch AI Training Center for Educators

New Breakthrough in Cyclic Models: 500 Steps of Training Makes Ultra-Long Sequences No Longer Difficult!

Concerns About AI Training in Germany: 70% of Employees Lack Access to Training, Companies May Be in Violation

Gemini CLI Major Update! Audio and Video Processing + New Privacy Features - A Blessing for Developers!

Open Source Revolution! Kyutai TTS Launches: Ultra-Low Latency Speech Synthesis, the New Era of AI Voice is Here!

Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

Stability AI Opensources Stable Audio Open Small, Turning Your Phone into an Audio Creation Wizard