The X-LANCE Artificial Intelligence Laboratory at Shanghai Jiao Tong University, in collaboration with ByteDance, has developed the LSLM (Listen-Speak Language Model), a full-duplex language model that enables AI assistants to listen and speak simultaneously during conversations, achieving true real-time interaction.
When you're conversing with an AI assistant and suddenly think of an important question, you don't have to wait for it to finish; you can interrupt and pose a new query immediately. The AI assistant can understand and respond instantly, as naturally and smoothly as a human conversation. This is no longer a scene from a sci-fi movie but has become a reality.
The core advantage of LSLM lies in its "listen-while-speaking" capability. This innovative model not only listens to external sounds while speaking but also supports real-time voice interaction, functioning normally even in noisy environments. It cleverly integrates the listening and speaking channels, capable of simultaneously processing voice input and generating voice output.
Traditional speech language models (SLM) can only engage in turn-taking dialogue and cannot handle immediate interruptions in real-life spoken scenarios. The advent of LSLM addresses this challenge, making AI-human dialogue more natural. It employs a token-based decoder for text-to-speech (TTS) systems, combined with a streaming self-supervised learning (SSL) encoder, to achieve real-time autoregressive generation and dialogue turn transition detection.
The research team explored three strategies: early fusion, mid-fusion, and late fusion, with mid-fusion achieving the best balance between speech generation and real-time interaction. Through command-based FDM and sound-based FDM experimental setups, LSLM demonstrated strong resistance to noise and high sensitivity to diverse instructions.
More surprisingly, LSLM achieved dual communication capabilities with minimal impact on existing systems. This means it can be seamlessly integrated into current AI systems, significantly enhancing user experience without the need for a complete overhaul of the framework.
The application prospects of LSLM are vast. In the future, whether at home, in the office, or public spaces, dialogue systems will be able to interact more naturally with humans in real-time. This will not only change how we communicate with machines but could also reshape the entire landscape of human-machine interaction.
In the technical demonstration, the research team vividly showcased LSLM's advantages by comparing traditional TTS with LSLM in both clear and noisy environments. They also illustrated the evolution of speech language models from simplex, half-duplex to full-duplex, making the significance of this technological breakthrough more intuitive.
As LSLM technology continues to mature, we have reason to expect that future AI assistants will provide users with richer, smoother, and more human-like interactive experiences. Conversing naturally and coherently with AI may soon be as easy as chatting with a friend.
This research is not only academically significant but also opens up new possibilities for the commercial application of voice interaction technology. The emergence of LSLM marks the beginning of a new era of AI interaction, where the boundaries of human-machine dialogue will become increasingly blurred, and the fusion of technology and humanity will reach new heights.
Project Link: https://top.aibase.com/tool/lslm