French startup Gladia has raised $16 million in an A-round funding for its speech recognition application programming interface (API). Essentially, Gladia's API can convert any audio file into text with high accuracy and low latency.
Although Amazon, Microsoft, and Google offer speech-to-text APIs as part of their cloud hosting product suites, their performance does not match some of the innovative models provided by specialized startups. Especially since OpenAI released the Whisper model, the field has made significant strides in recent years. Gladia competes with well-funded companies like AssemblyAI, Deepgram, and Speechmatics.
Image source note: The image was generated by AI, provided by the image licensing service Midjourney
Gladia initially offered a fine-tuned version of the Whisper speech-to-text model, with some necessary improvements. For example, the startup supports speaker separation out of the box—it can detect when there are multiple speakers in a conversation and separate the recording and transcription text according to who is speaking.
Gladia supports 100 languages and various accents. The tool reportedly works effectively, as we have been using Gladia to transcribe some interviews, and accents have not been an issue.
The startup offers its speech-to-text model as a hosted API, which users can integrate into their own applications and services. Over 600 companies use Gladia, including several meeting recorders and note-taking assistants like Attention, Circleback, Method Financial, Recall, Sana, and Veed.io.
This particular use case is interesting because many companies must chain API calls. They first convert speech to text, then input the text into large language models (LLMs) like GPT-4o or Claude3.5Sonnet to extract knowledge from large amounts of text.
With the new funds, Gladia hopes to streamline this process by integrating audio intelligence and LLM-based tasks into a single API call. For example, customers can generate conversation summaries from a few bullet points without relying on third-party LLM APIs.
Another issue Gladia aims to address is latency. You may have seen demonstrations of real-time audio conversations using AI-based call agents (11x has a good demo on their website), which must transcribe in real-time to make the conversation sound as human as possible.
Gladia has chosen to tackle this problem and currently can transcribe real-time conversations with a latency of less than 300 milliseconds. The company claims that real-time processing is now as good as the default asynchronous batch transcription API, though it's hard to judge without proper testing. As co-founder and CEO Jean-Louis Quéguiner (pictured right) told TechCrunch, the startup's goal is "batch quality with real-time capability."
In addition to AI call agents, it's conceivable that call centers could use these real-time features to help call agents find relevant information during a call. "Our single API is compatible with all existing technology stacks and protocols, including SIP, VoIP, FreeSwitch, and Asterisk," said co-founder and CTO Jonathan Soto (pictured left) in a statement.
XAnge led the A-round funding. Illuminate Financial, XTX Ventures, Athletico Ventures, Gaingels, Mana Ventures, Motier Ventures, Roosh Ventures, and Soma Capital also participated.
Gladia believes we are on the cusp of a "ChatGPT moment" for audio applications. GPT technology has been around for years, but ChatGPT really popularized LLMs through its consumer-like chat interface.
As Apple or Google begins to include transcription models in iOS or Android, consumers will start to understand the value of automatic transcription in the applications they use. Then developers may integrate audio features into their products, which is where API providers like Gladia come in.