In the realm of artificial intelligence, large language models (LLMs) have become the pivotal force driving natural language processing (NLP) tasks. However, there is still a long way to go before these models can truly comprehend and generate cross-modal content, such as speech and text. The research team from Fudan University proposed an innovative solution in their paper "SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities" – SpeechGPT.

image.png

SpeechGPT is a novel large language model that not only understands speech and text but also seamlessly transitions between the two. The core of this technology lies in discretizing continuous speech signals to align with text modalities, enabling the model to perceive and generate speech.

What sets SpeechGPT apart is its ability to perceive and express emotions, providing diverse stylistic speech responses based on context and human instructions. Whether it's rap, drama, robotic, humorous, or whispering styles, SpeechGPT can generate the appropriate voice according to the need, thanks to its over 100,000 hours of academic and field-collected speech data, covering a rich array of speech scenarios and styles.

To train SpeechGPT, the research team adopted a three-stage training strategy:

Modality Adaptation Pretraining: In this phase, the model is trained on a large amount of unlabeled speech data to predict the next discrete unit, adapting to the speech modality.

Cross-Modal Instruction Tuning: Utilizing the SpeechInstruct dataset, which contains instructions for various tasks, the model learns to understand and execute cross-modal instructions in this phase.

Modality Chain Instruction Tuning: In this phase, the model is further fine-tuned to optimize its ability to transition between modalities.

To support the training of SpeechGPT, the research team constructed the first large-scale cross-modal speech instruction dataset, SpeechInstruct. This dataset includes cross-modal instruction data and modality chain instruction data, covering multiple task types.

Experimental results show that SpeechGPT exhibits strong capabilities in text tasks, cross-modal tasks, and spoken dialogue tasks. It can accurately understand and execute various instructions, whether converting speech to text, text to speech, or engaging in spoken dialogue.

It is worth noting that while SpeechGPT demonstrates exceptional abilities, there are still some shortcomings in the robustness of speech understanding against noise and the stability of speech generation quality. These challenges are mainly due to limitations in computational and data resources. Currently, SpeechGPT is still under development, and the team plans to open-source the technical report, code, and model weights in the future, allowing a broader research community to participate in the further development and refinement of this technology.

Project page: https://0nutation.github.io/SpeechGPT.github.io/