On March 13th, Sesame company launched its latest speech synthesis model, CSM, attracting significant industry attention. According to the official introduction, CSM employs an end-to-end Transformer-based multimodal learning architecture. This allows it to understand contextual information and generate natural, emotionally rich speech with a remarkably lifelike quality.

The model supports real-time speech generation and can process both text and audio inputs. Users can also adjust parameters to control aspects such as tone, intonation, rhythm, and emotion, demonstrating high flexibility.

CSM is considered a significant breakthrough in AI speech technology. Its speech naturalness is so high that it's "impossible to distinguish from a human voice." Users have posted videos showcasing CSM's near-zero latency performance, calling it "the best model they've ever experienced." Previously, Sesame open-sourced a smaller version, CSM-1B, which supports multi-turn dialogue generation with coherent speech and received widespread acclaim.

Currently, CSM is primarily trained on English and performs exceptionally well, but its multilingual support is still limited. It does not currently support Chinese, but future expansion is anticipated.

Sesame has indicated it will partially open-source its research findings, and community developers are already enthusiastically discussing its potential on GitHub. CSM is not only applicable to conversational AI but also has the potential to revolutionize voice interaction experiences in education, entertainment, and other fields. Industry experts believe that CSM could redefine the standards for AI voice assistants, leading to more natural human-computer interaction.