Spirit LM

Multimodal language model that integrates text and speech

CommonProductProductivityMultimodalLanguage Model

Spirit LM is a fundamental multimodal language model that can freely combine text and speech. The model is based on a 7B pretrained text language model and extends to the speech modality through continuous training on both text and speech units. Speech and text sequences are concatenated into a single token stream and trained using a small automatically curated speech-text parallel corpus with a word-level interleaving approach. Spirit LM offers two versions: the basic version uses speech phoneme units (HuBERT), while the expressive version adds pitch and style units to simulate expressiveness. For both versions, text is encoded using subword BPE tokens. This model not only demonstrates the semantic capabilities of text models but also showcases the expressive abilities of speech models. Furthermore, we demonstrate that Spirit LM can learn new tasks across modalities with few samples (e.g., ASR, TTS, speech classification).

Visit

Spirit LM Visit Over Time

Monthly Visits

1456

Bounce Rate

38.32%

Page per Visit

1.0

Visit Duration

00:00:00

Spirit LM Visit Trend

Spirit LM Visit Geography

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Spirit LM

Spirit LM Visit Over Time

Spirit LM Visit Trend

Spirit LM Visit Geography

Spirit LM Traffic Sources

Spirit LM Alternatives

ultravox-v0_4_1-llama-3_1-8b — Multimodal speech large language model

Spirit LM — Multimodal language model that integrates text and speech

ultravox-v0_4_1-llama-3_1-70b — Multimodal speech large language model

ultravox-v0_4_1-mistral-nemo — Multimodal Speech Large Language Model

Tencent Cloud Speech Recognition ASR — Convert speech to text with support for real-time speech recognition, recording file recognition, and more.

SpeechGPT — Multimodal Language Model

EMOVA — Emotionally Rich Multimodal Language Model

Phi-4-multimodal-instruct — Phi-4-multimodal-instruct is a lightweight, multimodal foundational model developed by Microsoft, supporting text, image, and audio inputs.

speech-to-speech — Open-source speech-to-speech conversion module

Whisper — General-purpose Speech Recognition Model

SenseVoiceSmall — Multi-language high-precision speech recognition model

MiniCPM-o-2_6 — MiniCPM-o 2.6 is a powerful multimodal large language model designed for visual, speech, and multimodal live applications.

imp-v1-3b — A powerful multimodal small language model.

MouSi — Multimodal Visual Language Model

Llama3-s v0.2 — Latest multimodal checkpoint to enhance speech comprehension capabilities.

Seed-ASR — Speech recognition technology based on large language models.

InternVL2_5-2B-MPO — Advanced multimodal large language model

WhisperKit — Automatic Speech Recognition Model Compression & Optimization Tool

NVLM-D-72B — State-of-the-art multimodal large language model

Whisper large-v3-turbo — Efficient automatic speech recognition model

InternVL2_5-1B — A large multimodal language model that supports image and text understanding.

whisper-ner-v1 — An advanced model for joint speech transcription and entity recognition.

InternVL2_5-38B — Advanced Multimodal Large Language Model Series

mPLUG-DocOwl — A modular multimodal large language model for document understanding

MNN Large Model Android App — A fully functional Android app supporting multimodal capabilities with a large language model.

TinyGPT-V — Efficient multimodal large language model

NVLM 1.0 — Cutting-edge multimodal large language model

MiniGemini — A multimodal large language model capable of understanding and generating images

Pixtral-Large-Instruct-2411 — A 124B-parameter multimodal large language model.

OmniAudio-2.6B — The fastest edge-deployed audio language model in the world.

Spirit LM

Spirit LM Visit Over Time

Spirit LM Visit Trend

Spirit LM Visit Geography

Spirit LM Traffic Sources

Spirit LM Alternatives

ultravox-v0_4_1-llama-3_1-8b — Multimodal speech large language model

Spirit LM — Multimodal language model that integrates text and speech

ultravox-v0_4_1-llama-3_1-70b — Multimodal speech large language model

ultravox-v0_4_1-mistral-nemo — Multimodal Speech Large Language Model

Tencent Cloud Speech Recognition ASR — Convert speech to text with support for real-time speech recognition, recording file recognition, and more.

SpeechGPT — Multimodal Language Model

EMOVA — Emotionally Rich Multimodal Language Model

Phi-4-multimodal-instruct — Phi-4-multimodal-instruct is a lightweight, multimodal foundational model developed by Microsoft, supporting text, image, and audio inputs.

speech-to-speech — Open-source speech-to-speech conversion module

Whisper — General-purpose Speech Recognition Model

SenseVoiceSmall — Multi-language high-precision speech recognition model

MiniCPM-o-2_6 — MiniCPM-o 2.6 is a powerful multimodal large language model designed for visual, speech, and multimodal live applications.

imp-v1-3b — A powerful multimodal small language model.

MouSi — Multimodal Visual Language Model

GEO Services