ByteDance Releases MegaTTS3 on Hugging Face: A Breakthrough in Lightweight Speech Synthesis

AIbase基地

Published inAI News · 7 min read · Apr 3, 2025

Beijing—ByteDance recently released its latest text-to-speech (TTS) model, MegaTTS3, on the Hugging Face open-source AI community. This release has quickly garnered attention from AI researchers and developers worldwide due to its groundbreaking lightweight design and multilingual support. Based on community feedback and official information, MegaTTS3 is hailed as a significant advancement in speech synthesis.

MegaTTS3's Core Highlights

MegaTTS3, a collaborative effort between ByteDance and Zhejiang University, is an open-source speech synthesis tool. Its core model boasts only 45 million parameters, significantly smaller than traditional large-scale TTS models. This lightweight design reduces computational resource requirements, making it suitable for deployment on resource-constrained devices like mobile phones or edge computing environments.

The model supports Chinese and English speech generation and uniquely features mixed Chinese-English reading capabilities, smoothly handling bilingual text. Furthermore, MegaTTS3 incorporates accent intensity control, allowing users to adjust parameters to generate speech with varying degrees of accent, opening up possibilities for personalized voice applications. As one technical expert commented, "The accent intensity control is a particularly impressive feature."

Enthusiastic Response from the Open-Source Community

MegaTTS3's code and pre-trained models are freely available on GitHub and Hugging Face, allowing users to download and utilize them for research or development. According to the Hugging Face project page, MegaTTS3 aims to advance and popularize artificial intelligence through open-source and open science. This initiative continues ByteDance's tradition of open-sourcing AI technologies; previous releases like AnimateDiff-Lightning and Hyper-SD have also been well-received by the community.

Developers in the tech community have highly praised MegaTTS3's lightweight nature and practicality. A senior engineer commented, "Achieving this level of performance with only 45 million parameters makes it ideal for small teams and independent developers." Many developers plan to integrate it into educational tools to create bilingual audiobooks.

Technical Details and Future Outlook

MegaTTS3's efficiency stems from its innovative model architecture. While the specifics aren't fully public, official documentation mentions that the model generates high-quality speech while also supporting voice cloning—mimicking a specific voice tone with just a few seconds of audio sample. ByteDance plans to add pronunciation and duration control features to MegaTTS3 in the future, further enhancing its flexibility and application scenarios.

Meanwhile, the model's hardware requirements are relatively modest. While using a GPU significantly speeds up generation, the developers state that it can run on a CPU, lowering the barrier to entry. However, some users have reported difficulties during installation due to network issues or incompatible dependency versions on technical forums. They are advised to check the GitHub issue page for solutions.

Application Prospects and Industry Impact

MegaTTS3 opens up new possibilities in various fields. In academic research, it can be used to test the limits of speech synthesis technology. In content creation, it can provide cost-effective, high-quality narration for videos or podcasts. In education, its bilingual support and voice cloning capabilities can facilitate the development of more interactive learning tools. Developers can also embed it into smart devices for Chinese-English voice interaction.

Industry experts believe that MegaTTS3's open-source nature will accelerate innovation in speech technology for small and medium-sized enterprises and individual developers. As ByteDance's mission statement on Hugging Face says, "We are committed to democratizing artificial intelligence through open-source and open science." This lightweight, high-performance TTS model is another manifestation of this vision.

Conclusion

With the release of MegaTTS3 on Hugging Face, ByteDance once again demonstrates its leading position in AI technology research and open-source sharing. From enthusiastic discussions in the tech community to practical applications by developers, this model is injecting new vitality into the speech synthesis field. With community participation and further feature enhancements, MegaTTS3 is poised to become a significant milestone in TTS technology development.

Developers interested in experiencing MegaTTS3 can visit the project page on Hugging Face (link: https://huggingface.co/ByteDance/MegaTTS3) or the GitHub repository to access the code and model files. This new tool may bring about a quiet revolution in the way we interact with voice.

ByteDance Restructures AI Product Line: Cat Box Leadership Change, Xinghui Merged into Doubao, Focusing on Growth

According to LatePost, ByteDance recently made significant adjustments to its AI product department, Flow. The social companionship AI product, Cat Box, has a new leader. The previous head, Liang Chenqi, has left the company, and has been replaced by Xi Yuan (codename), the former head of Xinghui. Meanwhile, the Xinghui team, which develops AI camera and image generation applications, is slated to merge into the Doubao App, under the unified management of Doubao App's head, Lu You (codename). The Flow department is headed by Zhu Jun and includes Doubao, Cat Box, Xinghui, Doubao Aixue, and G.

Dia: A Revolutionary Open-Source TTS Model with Emotion and Non-Verbal Cues

Nari Labs, a two-person startup, has released Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to generate natural conversations directly from text prompts. Co-founder Toby Kim claims Dia surpasses proprietary offerings from competitors like ElevenLabs, as well as Google's NotebookLM AI podcast generation capabilities, and potentially even OpenAI's recently released gpt-4o-mini.

ByteDance Research Open-Sources ChatTS-14B: Native Understanding and Reasoning Over Time

ByteDance Research has announced the open-sourcing of ChatTS-14B, a 14-billion parameter large language model (LLM) specifically designed for understanding and reasoning with time series data. Released under the Apache2.0 license, ChatTS-14B's open-source release has garnered significant attention within the AI community, marking a substantial advancement in the intersection of time series analysis and generative AI. ChatTS-14B: An Intelligent Conversational Engine for Time Series. ChatTS-14B is based on Qwen2.5-1...

Coze Space Officially Opens Beta Testing, Supporting MCP Extension Integration

ByteDance's technology team announced that its new AI collaborative workspace, "Coze Space", is officially opening beta testing. Coze Space aims to be the optimal place for users to collaborate with AI Agents, providing comprehensive services ranging from answering questions to solving problems, helping users work more efficiently.

BMW Brilliance and ByteDance's Volcano Engine Partner to Drive AI-Powered Automotive Marketing

Recently, BMW Brilliance Lynk & Co Digital Information Technology Co., Ltd. (Lynk & Co) and ByteDance's Volcano Engine have partnered to innovate automotive marketing services with the help of Artificial Intelligence (AI) technology. This collaboration leverages AI to achieve precise product matching and purchase recommendations, optimize content guidance, and enhance the user car-buying experience and dealer operational efficiency. BMW Group President and CEO in Greater China, Gao Xiang, stated that AI is key to BMW's creation of smarter and more considerate mobility solutions, and is being rapidly integrated into R&D, production, supply chain, product, service, and operations.

ByteDance Releases UI-TARS-1.5: Open-Source Multimodal Agent Leading a New Wave in GUI Automation

ByteDance has officially released UI-TARS-1.5 on the Hugging Face platform, an open-source multimodal agent built upon a powerful vision-language model. This release marks another significant breakthrough for ByteDance in the field of AI automated interaction, providing developers and users with a highly efficient and intelligent cross-platform GUI (Graphical User Interface) automation solution. UI-TARS-1.5: A New Benchmark for Multimodal Agents. UI-TARS-1.5 is the latest in ByteDance's UI-TARS series...

ByteDance Doubao Open-Source Seed Agent Model UI-TARS-1.5

The ByteDance Doubao large model team announced the open-sourcing of UI-TARS-1.5, an open-source multimodal agent built on a vision-language model capable of efficiently executing various tasks in a virtual world. The model achieved state-of-the-art (SOTA) performance on seven typical GUI (Graphical User Interface) benchmark evaluations and demonstrated, for the first time, its long-term reasoning capabilities in games and interactive capabilities in open spaces. This open-source project marks a significant advancement in multimodal agent technology for GUIs.

AI Daily: ByteDance Releases Doubao 1.5 Deep Thinking Model; WeChat Launches Yuanbao, its First AI Assistant; OpenAI Releases o4-mini and a Full-Blooded o3

Welcome to the 【AI Daily】column! Your daily guide to exploring the world of artificial intelligence. We present you with the hottest AI news, focusing on developers and helping you understand technology trends and innovative AI product applications. Discover new AI products here: https://top.aibase.com/1、OpenAI released two multimodal reasoning models, o4-mini and a full-blooded o3. OpenAI showcased its latest multimodal models, o4-mini and a full-blooded o3, during a technical livestream.