
Real-time Zero-Lip Speech Conversion with Stream Context-Aware Language Modeling

CommonProductMusicSpeech ConversionContext-Aware
StreamVoice is a language model-based zero-lip speech conversion model that enables real-time conversion without requiring the complete source speech. It utilizes a full causal context-aware language model combined with a time-independent acoustic predictor, allowing it to alternately process semantic and acoustic features at each time step, thereby eliminating the dependency on complete source speech. To enhance the performance degradation that may arise in streaming due to incomplete context, StreamVoice employs two strategies to augment the language model's context-awareness: 1) Teacher-guided Context Prediction, where a teacher model summarizes the current and future semantic context during training, guiding the model to predict missing contexts; 2) Semantic Masking Strategy, which promotes acoustic prediction from previously damaged semantic and acoustic inputs, enhancing the contextual learning capability. Notably, StreamVoice is the first language model-based streaming zero-lip speech conversion model that does not require any future prediction. Experimental results demonstrate that StreamVoice exhibits streaming conversion capabilities while maintaining comparable zero-lip performance to non-streaming speech conversion systems.

StreamVoice Visit Over Time

Monthly Visits


Bounce Rate


Page per Visit


Visit Duration


StreamVoice Visit Trend

StreamVoice Visit Geography

StreamVoice Traffic Sources

StreamVoice Alternatives