Recently, the LANCE Lab at Shanghai Jiao Tong University and ByteDance have jointly introduced a new interactive speech model named LSLM. This model is said to excel in simultaneous listening and speaking, delivering a conversational experience that closely mimics human natural dialogue.

LSLM, nicknamed "Little L," addresses the limitations of existing speech models in real-time interaction, noise resistance, and recognition of unknown speakers, bringing it closer to human-like natural dialogue. It features an end-to-end design with both auditory and vocal channels, utilizes a decoder-only TTS for speech generation, and employs a streaming self-supervised learning (SSL) encoder to process audio inputs in real-time.

"Little L" boasts unique features: full-duplex modeling (FDM), enabling simultaneous listening and speaking, allowing interruptions and alternations in conversations; strong noise resistance, maintaining stability in noisy environments and adapting to various real-world scenarios; and sensitivity to unknown speakers, capable of identifying and responding to new voices and commands, accommodating different users.

Project details: https://ziyang.tech/LSLM/

Paper: https://arxiv.org/abs/2408.02622