ultravox-v0_4_1-mistral-nemo is a multimodal speech large language model (LLM) based on pre-trained Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo. The model can handle both speech and text input simultaneously, such as a text system prompt and a speech user message. Ultravox converts input audio into embeddings using a special <|audio|> pseudo-token and generates output text. Future versions plan to expand the token vocabulary to support generating semantic and acoustic audio tokens, which can then be input into a vocoder to produce speech output. The model is developed by Fixie.ai and is licensed under MIT.