fixie-ai/ultravox-v0_4_1-llama-3_1-8b is a large language model based on pre-trained Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing speech and text input to generate text output. The model converts input audio to embeddings using a special <|audio|> pseudo-token and generates output text. Future versions plan to expand the token vocabulary to support semantic and acoustic audio token generation, which can then be used by a vocoder to produce speech output. The model performs excellently in translation evaluation and has no preference adjustment, making it suitable for scenarios such as voice agents, speech-to-speech translation, and speech analysis.