ElevenLabs is launching its own speech-to-text model

ElevenLabs, an AI startup that simply raised a $180 million mega funding round, has been primarily recognized for its audio era prowess. The corporate took a step in one other technological route by launching its first standalone speech-to-text mannequin referred to as Scribe.

The startup, valued at $3.3 billion, has aided many different firms in offering speech-to-text companies by means of its huge library of voices. Nevertheless, the corporate is now trying to get into speech detection and compete with the likes of Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper fashions.

ElevenLabs’ Scribe mannequin helps over 99 languages at launch. The corporate categorizes over 25 languages in glorious accuracy class for the mannequin the place the phrase error fee is lower than 5%. This checklist consists of English (claimed accuracy fee of 97%), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Different languages are ranked in several classes with excessive (5-10% phrase error fee), good (10 to twenty% phrase error fee), and average (25 to 50%) phrase error charges.

The corporate stated that the mannequin outperformed Google Gemini 2.0 Flash and Whisper Giant V3 throughout a number of languages in FLEURS & Frequent Voice benchmark exams.

ElevenLabs had developed the speech-to-text element for its AI conversational agent platform, which was launched final yr. Nevertheless, that is the primary time the company is releasing a standalone speech detection model. In a dialog with TechCrunch final month, CEO Mati Staniszewski talked about enhancing speech detection fashions.

“We need to perceive what’s being stated by you in a dialog higher. We’re engaged on methods to maneuver away from solely producing content material and understanding and transcribing speech,” Staniszewski stated at the moment. “Many individuals say that speech-to-text is a solved drawback. However for a lot of languages, it’s fairly unhealthy. We predict we are able to construct higher speech detection fashions as a result of we now have in-house groups to annotate information and provides us fast suggestions.”

The mannequin additionally has good speaker diarization to let you know who’s talking, timestamp at phrase stage for correct subtitles, and auto-tagging sound occasions like viewers laughters. The startup is offering a approach for patrons to instantly transcribe video content material so as to add subtitles or captions in its studio.

Scribe at the moment solely works with pre-recorded audio codecs. The corporate stated it is going to launch a low-latency real-time model of the mannequin quickly. Which means it isn’t but efficient for assembly transcriptions or voice note-taking.

ElevenLabs is pricing Scribe at $0.40 for an hour of transcribed audio. Whereas the speed is aggressive, some of its rivals offer a lower price for audio transcriptions in the intervening time with some function differentiation.

Source link