Meta AI has recently made a significant open-source release of a foundational multimodal language model called SPIRIT LM. This model can freely mix text and speech, opening up new possibilities for multimodal tasks involving audio and text.

SPIRIT LM is based on a pre-trained text language model with 7 billion parameters. It extends into the speech modality through continuous training on text and speech units. It can understand and generate text just like large text models, while also being capable of understanding and generating speech, even mixing text and speech together to create various amazing effects! For instance, you can use it for speech recognition to convert speech into text; you can also use it for speech synthesis to convert text into speech; and it can be used for speech classification to determine the emotion expressed in a piece of speech.

image.png

What's even more impressive is that SPIRIT LM is particularly skilled at "emotional expression"! It can recognize and generate various speech tones and styles, making the AI's voice sound more natural and emotive. You can imagine that the voice generated by SPIRIT LM is no longer a cold, robotic sound, but rather sounds like a real person speaking, full of emotions!

To enhance the AI's ability to "express emotions", Meta's researchers have developed two versions of SPIRIT LM:

"Base Version" (BASE): This version mainly focuses on the phonetic information of speech, which is the "basic composition" of speech.

"Expressive Version" (EXPRESSIVE): This version includes not only phonetic information but also tone and style information, allowing the AI's voice to be more vivid and expressive.

image.png

So, how does SPIRIT LM achieve all of this?

In simple terms, SPIRIT LM is trained based on Meta's previously released powerful text model—LLAMA2. Researchers fed a large amount of text and speech data to LLAMA2 and employed a special "interleaved training" method, enabling LLAMA2 to learn the patterns of both text and speech simultaneously.

To test SPIRIT LM's "emotional expression" capabilities, Meta's researchers designed a new testing benchmark called the "Speech-Text Emotion Preservation Benchmark" (STSP). This benchmark includes various speech and text prompts expressing different emotions, aimed at testing whether the AI model can accurately recognize and generate the corresponding emotional speech and text. The results show that the "Expressive Version" of SPIRIT LM performs excellently in emotion preservation, being the first AI model capable of cross-modal emotion retention!

Of course, Meta's researchers also admit that SPIRIT LM has many areas for improvement. For instance, SPIRIT LM currently only supports English, and it needs to expand to other languages in the future; the model size of SPIRIT LM is still not large enough, and it will need to continue to grow to enhance model performance.

SPIRIT LM is a significant breakthrough for Meta in the field of AI, opening the door to a world of "emotionally expressive" AI. We believe that in the near future, we will see more interesting applications developed based on SPIRIT LM, allowing AI not only to speak but also to express emotions like a real person, facilitating more natural and friendly interactions with us!

Project Address: https://speechbot.github.io/spiritlm/

Paper Address: https://arxiv.org/pdf/2402.05755