Speech to Text | ElevenLabs Documentation

Overview

The ElevenLabs Speech to Text (STT) API turns spoken audio into text with state of the art accuracy. Our Scribe v1 model adapts to textual cues across 99 languages and multiple voice styles and can be used to:

Transcribe podcasts, interviews, and other audio or video content
Generate transcripts for meetings and other audio or video recordings

Developer tutorial

Learn how to integrate speech to text into your application.

Product guide

Step-by-step guide for using speech to text in ElevenLabs.

Companies requiring HIPAA compliance must contact ElevenLabs Sales to sign a Business Associate Agreement (BAA) agreement. Please ensure this step is completed before proceeding with any HIPAA-related integrations or deployments.

State of the art accuracy

The Scribe v1 model is capable of transcribing audio from up to 32 speakers with high accuracy. Optionally it can also transcribe audio events like laughter, applause, and other non-speech sounds.

The transcribed output supports exact timestamps for each word and audio event, plus diarization to identify the speaker for each word.

The Scribe v1 model is best used for when high-accuracy transcription is required rather than real-time transcription. A low-latency, real-time version will be released soon.

Pricing

Developer API

Product interface pricing

Tier	Price/month	Hours included	Price per included hour	Price per additional hour
Free	$0	Unavailable	Unavailable	Unavailable
Starter	$5	12 hours 30 minutes	$0.40	Unavailable