Meta AI's latest offering, SPIRIT-LM, is a groundbreaking multimodal foundational language model that can freely mix text and speech and understand and express emotions like humans.

SPIRIT-LM is built on a pre-trained text language model and extends to the speech modality by continuous training on text and speech units. The model connects speech and text sequences into a single token set and uses a small, self-managed speech-text parallel corpus with a word-level interleaving method for training.

QQ20241021-092227.png

SPIRIT-LM comes in two versions:

The Base version (SPIRIT-LM-BASE) uses speech semantic units.

The Expressive version (SPIRIT-LM-EXPRESSIVE) uses pitch and style units to simulate emotional expression, in addition to semantic units.

Both versions use subword BPE tokens to encode text.

SPIRIT-LM combines the semantic capabilities of text models with the expressive power of speech models, enabling it to perform cross-modal tasks such as speech recognition, text-to-speech conversion, and speech classification with only a few samples needed to learn new tasks.

To evaluate the expressive capabilities of the generative model, researchers introduced the Speech-Text Sentiment Preservation Benchmark (STSP), which measures the degree to which the generative model preserves emotions in both intra-modal and cross-modal scenarios for oral and written expressions.

The Expressive version of SPIRIT-LM is the first language model capable of preserving emotions from text and speech cues in both intra-modal and cross-modal scenarios. It utilizes pitch and style tokens to capture the emotional and stylistic aspects of speech and is evaluated through a specially designed Speech-Text Sentiment Preservation Benchmark.

QQ20241021-092239.png

Research findings indicate:

SPIRIT-LM is on par with existing models in understanding vocabulary, grammar, and semantics in the speech modality while maintaining strong text generation capabilities.

Interleaved training is key to SPIRIT-LM's success, allowing the model to learn the correspondence between speech and text tokens, thereby achieving better text-to-speech conversion.

Pre-training knowledge is crucial for SPIRIT-LM's few-shot learning ability.

SPIRIT-LM-EXPRESSIVE can capture and generate more expressive speech, outperforming the Base version in emotional expression.

SPIRIT-LM marks a significant milestone in the history of AI language models, opening up new possibilities for multimodal language understanding and generation, and laying the foundation for future smarter and more human-like AI applications.

Paper link: https://arxiv.org/pdf/2402.05755

Project link: https://github.com/facebookresearch/spiritlm