OuteTTS-0.1-350M
A text-to-speech synthesis model that operates through a pure language model.
CommonProductProductivityText-to-speechVoice synthesis
OuteTTS-0.1-350M is a text-to-speech synthesis technology based on a pure language model, requiring no external adapters or complex architectures, achieving high-quality voice synthesis through carefully designed prompts and audio tokenization. This model is based on the LLaMa architecture, utilizing 350 million parameters to demonstrate the potential for direct voice synthesis using language models. It processes audio in three steps: using WavTokenizer for audio tokenization, creating precise word-to-audio mappings through CTC forced alignment, and generating structured prompts that follow specific formats. The key advantages of OuteTTS include a pure language modeling approach, voice cloning capabilities, and compatibility with llama.cpp and GGUF formats.