ByteDance and Shanghai Jiao Tong University Launch New Speech Model LSLM for Interactive Listening and Speaking

AIbase基地

Published inAI News · 6 min read · Aug 9, 2024

457

The X-LANCE Artificial Intelligence Laboratory at Shanghai Jiao Tong University, in collaboration with ByteDance, has developed the LSLM (Listen-Speak Language Model), a full-duplex language model that enables AI assistants to listen and speak simultaneously during conversations, achieving true real-time interaction.

When you're conversing with an AI assistant and suddenly think of an important question, you don't have to wait for it to finish; you can interrupt and pose a new query immediately. The AI assistant can understand and respond instantly, as naturally and smoothly as a human conversation. This is no longer a scene from a sci-fi movie but has become a reality.

The core advantage of LSLM lies in its "listen-while-speaking" capability. This innovative model not only listens to external sounds while speaking but also supports real-time voice interaction, functioning normally even in noisy environments. It cleverly integrates the listening and speaking channels, capable of simultaneously processing voice input and generating voice output.

Traditional speech language models (SLM) can only engage in turn-taking dialogue and cannot handle immediate interruptions in real-life spoken scenarios. The advent of LSLM addresses this challenge, making AI-human dialogue more natural. It employs a token-based decoder for text-to-speech (TTS) systems, combined with a streaming self-supervised learning (SSL) encoder, to achieve real-time autoregressive generation and dialogue turn transition detection.

The research team explored three strategies: early fusion, mid-fusion, and late fusion, with mid-fusion achieving the best balance between speech generation and real-time interaction. Through command-based FDM and sound-based FDM experimental setups, LSLM demonstrated strong resistance to noise and high sensitivity to diverse instructions.

More surprisingly, LSLM achieved dual communication capabilities with minimal impact on existing systems. This means it can be seamlessly integrated into current AI systems, significantly enhancing user experience without the need for a complete overhaul of the framework.

The application prospects of LSLM are vast. In the future, whether at home, in the office, or public spaces, dialogue systems will be able to interact more naturally with humans in real-time. This will not only change how we communicate with machines but could also reshape the entire landscape of human-machine interaction.

In the technical demonstration, the research team vividly showcased LSLM's advantages by comparing traditional TTS with LSLM in both clear and noisy environments. They also illustrated the evolution of speech language models from simplex, half-duplex to full-duplex, making the significance of this technological breakthrough more intuitive.

As LSLM technology continues to mature, we have reason to expect that future AI assistants will provide users with richer, smoother, and more human-like interactive experiences. Conversing naturally and coherently with AI may soon be as easy as chatting with a friend.

This research is not only academically significant but also opens up new possibilities for the commercial application of voice interaction technology. The emergence of LSLM marks the beginning of a new era of AI interaction, where the boundaries of human-machine dialogue will become increasingly blurred, and the fusion of technology and humanity will reach new heights.

Project Link: https://top.aibase.com/tool/lslm

Moonshot AI Releases and Opensources Kimi K2 Model, Strong in Code and Agentic Tasks

Moonshot AI officially released its latest creation - the Kimi K2 model, and simultaneously announced its open source. This foundation model based on the MoE architecture has gained widespread attention in the AI field since its release, thanks to its strong coding capabilities and excellent general Agent task processing abilities. The Kimi K2 model has a total of 1T parameters, with 32B activated parameters. It has achieved top performance among open-source models in a series of benchmark performance tests such as SWE Bench Verified, Tau2, and AceBench.

AI Daily: Zhipu Launches PPT Generation Function AI Slides; Ke Ling AI Releases Ketur 2.1 Model

1. Zhipu launches free AI Slides for PPT generation. 2. Keling AI introduces KeTu 2.1 with 180 styles. 3. NVIDIA's DiffusionRenderer enables 3D scene editing. 4. Modao AI offers 30-second prototype generation. 5. Higgsfield creates avatars from 10 photos. 6. Google open-sources GenAI Processors. 7. Google Veo3 adds image-to-video. 8. Mistral AI releases Devstral2507 for code generation.....

Google DeepMind Open Sources GenAI Processors: One-Click Building of Real-Time AI Workflows

Google DeepMind open sources the GenAI Processors Python library, helping developers build efficient generative AI workflows. The library supports asynchronous processing of multimodal data and optimizes Gemini API application development, significantly reducing latency in real-time applications. Core features include a modular Processor interface, streaming API design, and concurrency optimization, enabling rapid development of real-time applications such as intelligent assistants. Currently only supports Python, but with an open community contribution model, future plans include expanding functionality to cover more scenarios.

Manus AI Official Website and Social Media Undergo Changes, Chinese Users May Be Affected

General AI company Manus adjusts its China operations, lays off employees, and relocates its core technology team to Singapore. The China region had approximately 120 employees, and the company states this move is aimed at improving operational efficiency and focusing on core business. The official website now shows that the region is unavailable, replacing previous messages about the development of the Chinese version. The official Weibo and Xiaohongshu accounts have also been cleared, indicating a significant shift in the company's market strategy in China.

Modo AI Launches: Input Your Idea and Generate a High-Fidelity, Editable Prototype in 30 Seconds

Modo AI introduces a 30-second rapid prototype generation feature, supporting multi-device adaptation and conversation optimization. Users can generate high-fidelity, editable prototypes through text, sketches, and other input methods, and support iterative conversation adjustments. The AI can intelligently parse uploaded sketches, wireframes, and more, automatically generating interfaces. It offers dual-mode editing, automatic documentation generation, and code integration features, covering multiple scenarios such as e-commerce and social networking, significantly lowering the barrier to prototype creation and improving product design efficiency.

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

Mistral AI launched the Devstral2507 series with two AI models: the open-source Devstral Small1.1 (24 billion parameters, SWE-Bench score of 53.6%) and the enterprise version Devstral Medium2507 (score of 61.6%). Small1.1 supports a 128k context window and local deployment, while Medium2507 outperforms some commercial models. Both are optimized for code reasoning and program synthesis, and support integration with agent frameworks.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

ByteDance and Shanghai Jiao Tong University Launch New Speech Model LSLM for Interactive Listening and Speaking

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Moonshot AI Releases and Opensources Kimi K2 Model, Strong in Code and Agentic Tasks

Mafengwo AI Itinerary Fully Opened, AI Travel Assistant Adds New Practical Features

AI Daily: Zhipu Launches PPT Generation Function AI Slides; Ke Ling AI Releases Ketur 2.1 Model

Google DeepMind Open Sources GenAI Processors: One-Click Building of Real-Time AI Workflows

Manus AI Official Website and Social Media Undergo Changes, Chinese Users May Be Affected

Modo AI Launches: Input Your Idea and Generate a High-Fidelity, Editable Prototype in 30 Seconds

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

Generate a Professional PPT in 5 Minutes! Zhipei AI Slides Has Been Launched, GLM-Experimental Brings You a Glimpse of the Future of Work

AWS Intensifies Infrastructure in AI Competition, SageMaker Platform Receives Major Upgrade

NVIDIA's market value exceeds $4 trillion for the first time, Huang Renxun's meeting with Trump draws attention