ElevenLabs, a prominent AI voice cloning and generation startup, recently launched its latest speech-to-text model – Scribe v1. This model claims to achieve the highest accuracy across multiple languages, and users can experience it via the company's website.

QQ_1740621264139.png

According to ElevenLabs' benchmarks, Scribe surpasses Google's Gemini 2.0 Flash, OpenAI's Whisper v3, and Deepgram Nova-3 in accurately converting spoken language to text, achieving an unprecedentedly low error rate. The company states that Scribe supports high-precision transcription in 99 languages, including previously underserved languages like Serbian, Cantonese, and Malayalam.

ElevenLabs' Chief Research Scientist, Flavio Schneider, announced on X (formerly Twitter) that Scribe is the company's "most intelligent audio understanding model" to date. He emphasized that Scribe is more than just a transcription tool; it understands audio content, detecting non-speech events (like laughter, sound effects, music, and background noise), and accurately distinguishing speakers in long audio content within complex environments. Notably, Scribe can identify and isolate up to 32 different speakers within a single audio file.

QQ_1740621326377.png

ElevenLabs advises that Scribe is "best suited for scenarios requiring high-accuracy transcription, rather than real-time transcription." The company also plans to release a low-latency version to expand its use in real-time applications.

Based on benchmarks from FLEURS and Common Voice, Scribe excels at handling real-world audio challenges, achieving the lowest word error rates, particularly in Italian (98.7% accuracy) and English (96.7% accuracy).

Scribe is now available via the ElevenLabs website and API, priced at $0.40 per hour of input audio, with a 50% discount for the next six weeks. A low-latency version for real-time applications is also under development.

For enterprise decision-makers, Scribe provides a scalable tool for high-accuracy transcription, suitable for industries needing automated documentation, meeting transcription, and content accessibility. Its high-precision handling of multiple languages will also benefit multinational corporations, media companies, and customer support applications.

It's noteworthy that Scribe's release coincided with the launch of Hume's text-to-speech model, Octave. Octave, a large language model-based text-to-speech tool, allows users to customize AI-generated voices based on emotional needs, intended for content creation such as audiobooks, podcasts, and video game voiceovers. While Scribe and Octave have different functionalities, their simultaneous release reflects the increasingly fierce competition in AI-driven audio models.

Product Link: https://elevenlabs.io/blog/meet-scribe

Key Highlights:

🌟 Scribe v1 is ElevenLabs' latest speech-to-text model, achieving record-high accuracy across multiple languages.

🗣️ Supports 99 languages, can distinguish up to 32 different speakers, and adapts to complex audio environments.

💰 Currently priced at $0.40 per hour, with a 50% discount for the next six weeks; a low-latency version is under development.