CSM is a conversational speech generation model developed by Sesame. It can generate high-quality speech from text and audio input. The model is based on the Llama architecture and uses the Mimi audio encoder. It is mainly used for speech synthesis and interactive voice applications, such as voice assistants and educational tools. The main advantages of CSM are its ability to generate natural and fluent speech and its ability to optimize speech output through contextual information. The model is currently open-source and suitable for research and educational purposes.