AI Voice & Audio API Guide 2026
Speech-to-text, text-to-speech, and realtime voice APIs compared. OpenAI, Google, ElevenLabs, Deepgram pricing, quality, and integration patterns.
Voice is the fastest-growing interface for AI applications. From real-time voice assistants to podcast transcription, the audio API landscape in 2026 has matured dramatically. OpenAI's Realtime-2 API enables true conversational voice AI, ElevenLabs produces near-human speech synthesis, and speech-to-text accuracy has reached human parity for most languages. This guide covers every major voice and audio API, with pricing, quality benchmarks, and implementation patterns.
Speech-to-Text (STT / ASR)
Automatic Speech Recognition converts audio to text. The key metrics: Word Error Rate (WER), language support, and latency.
OpenAI Whisper / Realtime-Whisper
OpenAI offers two STT products in 2026:
| Product | Latency | Price | Best For |
|---|---|---|---|
| Whisper (batch) | ~30s for 10min audio | $0.006/min | File transcription, podcasts |
| Realtime-Whisper | Streaming (real-time) | $0.017/min | Live transcription, voice apps |
from openai import OpenAI
client = OpenAI()
# Batch transcription (file upload)
with open("meeting.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
print(transcript.text)
for segment in transcript.segments:
print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")
Whisper supports 99 languages and produces excellent results for most. It struggles with heavy accents, overlapping speech, and domain-specific jargon.
Google Chirp / Speech-to-Text V2
Google's latest speech model handles the most challenging audio conditions:
- Best-in-class for noisy environments and phone-call quality audio
- Automatic language detection (no need to specify language)
- Speaker diarization built-in (identify who said what)
- Streaming and batch modes
from google.cloud import speech_v2
client = speech_v2.SpeechClient()
# Streaming recognition
config = speech_v2.RecognitionConfig(
model="chirp_2",
auto_decoding_config=speech_v2.AutoDetectDecodingConfig(),
features=speech_v2.RecognitionFeatures(
enable_word_time_offsets=True,
enable_automatic_punctuation=True,
),
)
request = speech_v2.StreamingRecognizeRequest(
recognizer=f"projects/{project}/locations/global/recognizers/_",
streaming_config=speech_v2.StreamingRecognitionConfig(
config=config,
streaming_features=speech_v2.StreamingRecognitionFeatures(
enable_interim_results=True
)
)
)
# Stream audio from microphone
stream = client.streaming_recognize(requests=[request])
Deepgram
Deepgram specializes in ultra-low-latency speech recognition, making it the go-to for real-time applications:
- Latency: ~300ms end-to-end (fastest in class)
- Price: $0.0043/min (pay-as-you-go), volume discounts available
- Key feature: On-prem deployment available for healthcare/finance
from deepgram import DeepgramClient, PrerecordedOptions
deepgram = DeepgramClient("YOUR_API_KEY")
options = PrerecordedOptions(
model="nova-3",
smart_format=True,
punctuate=True,
diarize=True,
utterances=True,
)
source = {"url": "https://example.com/audio.mp3"}
result = await deepgram.listen.asyncprerecorded.v("1").transcribe(source, options)
for utterance in result.results.utterances:
print(f"Speaker {utterance.speaker}: {utterance.transcript}")
STT Comparison
| Provider | WER (English) | Latency | Languages | Price/min |
|---|---|---|---|---|
| OpenAI Whisper | ~5% | Batch only | 99 | $0.006 |
| OpenAI Realtime-Whisper | ~6% | Streaming | 99 | $0.017 |
| Google Chirp 2 | ~4% | Streaming | 125+ | $0.016 |
| Deepgram Nova-3 | ~5% | ~300ms | 40+ | $0.0043 |
| Azure Speech | ~5% | Streaming | 100+ | $0.01 |
Text-to-Speech (TTS)
TTS has improved dramatically — modern systems produce speech that's often indistinguishable from human recordings.
ElevenLabs
The quality leader in TTS. ElevenLabs voices are nearly indistinguishable from human speech:
| Plan | Characters/mo | Price | Key Features |
|---|---|---|---|
| Free | 10,000 | $0 | Basic voices |
| Starter | 30,000 | $5/mo | Voice cloning (1 min sample) |
| Creator | 100,000 | $22/mo | Professional voice cloning |
| Pro | 500,000 | $99/mo | Projects, high concurrency |
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key="YOUR_API_KEY")
# Generate speech
audio = client.text_to_speech.convert(
text="Welcome to the future of AI voice technology.",
voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel voice
model_id="eleven_multilingual_v2",
output_format="mp3_44100_128",
)
# Save to file
with open("output.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
# Clone a voice from a sample
voice = client.voices.clone(
name="My Voice",
files=["sample.mp3"], # 1+ minute of clean audio
)
ElevenLabs voice cloning requires just 1 minute of sample audio. The cloned voice captures accent, pacing, and emotion remarkably well. This has significant implications for accessibility and personalization — but also for misuse. Always implement voice verification.
OpenAI TTS
OpenAI's built-in TTS is simpler but cost-effective:
from openai import OpenAI
from pathlib import Path
client = OpenAI()
speech_file = Path("output.mp3")
response = client.audio.speech.create(
model="tts-1-hd",
voice="nova", # alloy, echo, fable, onyx, nova, shimmer
input="The quick brown fox jumps over the lazy dog.",
)
response.stream_to_file(speech_file)
OpenAI TTS pricing: $0.015/1K characters (standard), $0.030/1K characters (HD). Six built-in voices. Simple and reliable but no voice cloning.
Google Cloud TTS
Google offers WaveNet and the newer Studio voices:
from google.cloud import texttospeech_v1
client = texttospeech_v1.TextToSpeechClient()
synthesis_input = texttospeech_v1.SynthesisInput(text="Hello world")
voice = texttospeech_v1.VoiceSelectionParams(
language_code="en-US",
name="en-US-Studio-O", # Studio quality voice
)
audio_config = texttospeech_v1.AudioConfig(
audio_encoding=texttospeech_v1.AudioEncoding.MP3,
speaking_rate=1.0,
pitch=0.0,
)
response = client.synthesize_speech(
input=synthesis_input,
voice=voice,
audio_config=audio_config,
)
with open("output.mp3", "wb") as f:
f.write(response.audio_content)
TTS Comparison
| Provider | Quality | Voice Cloning | Languages | Price/1K chars |
|---|---|---|---|---|
| ElevenLabs | Best | Yes (1 min) | 29 | ~$0.18 |
| OpenAI TTS HD | Good | No | 50+ | $0.030 |
| OpenAI TTS | Good | No | 50+ | $0.015 |
| Google Studio | Very Good | No | 50+ | $0.016 |
| Google WaveNet | Good | No | 50+ | $0.016 |
| Azure Neural | Good | Custom Neural Voice | 140+ | $0.016 |
Realtime Voice AI
The most exciting development in 2026 is true realtime voice conversation with LLMs. OpenAI's Realtime-2 API leads the way:
OpenAI Realtime-2
Realtime-2 enables natural, low-latency voice conversations where the user can interrupt the model:
| Input Type | Price per 1M tokens |
|---|---|
| Audio input | $32.00 |
| Audio output | $64.00 |
| Text input | $4.00 |
| Text output | $24.00 |
| Cached audio input | $0.40 |
import asyncio
import websockets
import json
async def realtime_voice():
url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2"
headers = {
"Authorization": f"Bearer {API_KEY}",
"OpenAI-Beta": "realtime=v2",
}
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": "You are a helpful assistant.",
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 600
}
}
}))
# Send audio from microphone
# ... capture audio and send chunks
# Receive responses
async for message in ws:
event = json.loads(message)
if event["type"] == "response.audio.delta":
# Play audio chunk through speakers
audio_data = base64.b64decode(event["delta"])
play_audio(audio_data)
elif event["type"] == "response.done":
print("Turn complete")
asyncio.run(realtime_voice())
Key Realtime-2 features:
- Server-side Voice Activity Detection (VAD) — model knows when you stop talking
- User can interrupt the model mid-sentence
- Function calling works in voice mode
- ~300ms latency from end of speech to start of response
OpenAI Realtime-Translate
A dedicated real-time translation model — speaks in one language, outputs another in real-time:
# Realtime-Translate pricing: $0.034 per minute
# Translates speech in real-time, keeping pace with the speaker
This is purpose-built for live translation scenarios (conferences, customer support, etc.) and outperforms STT → translate → TTS pipelines in both quality and latency.
Implementation Patterns
Pattern 1: Voice Chatbot
The most common pattern — a voice interface to your LLM backend:
# Architecture: Microphone → STT → LLM → TTS → Speaker
async def voice_chat_loop():
while True:
# 1. Capture audio from microphone
audio = await capture_audio()
# 2. Transcribe
text = await transcribe(audio)
if not text.strip():
continue
# 3. Get LLM response (streaming)
response_text = ""
async for chunk in stream_llm(text):
response_text += chunk
# 4. Synthesize speech
audio_response = await synthesize(response_text)
# 5. Play to user
await play_audio(audio_response)
Pattern 2: Meeting Transcription + Summary
async def meeting_transcription(audio_stream):
full_transcript = []
# Real-time transcription
async for segment in deepgram_stream(audio_stream):
full_transcript.append({
"speaker": segment.speaker,
"text": segment.text,
"timestamp": segment.start
})
# Show live transcript in UI
update_live_view(segment)
# Generate summary with LLM
transcript_text = "\n".join(
f"[{s['timestamp']}] {s['speaker']}: {s['text']}"
for s in full_transcript
)
summary = await llm_summarize(transcript_text)
return {"transcript": full_transcript, "summary": summary}
Pattern 3: Multilingual Voice Agent
async def multilingual_agent(audio_input, target_language="en"):
# 1. Detect language and transcribe
detected_lang, text = await detect_and_transcribe(audio_input)
# 2. Process with LLM (understands all languages)
response = await llm_chat(text)
# 3. Synthesize in target language
if detected_lang != target_language:
response = await translate(response, target_language)
audio_response = await synthesize(response, language=target_language)
return audio_response
Cost Estimates
For a voice-enabled chatbot handling 10,000 conversations per day, 30 seconds average each:
| Component | Provider | Daily Cost | Monthly Cost |
|---|---|---|---|
| STT (input) | Deepgram Nova-3 | $21.50 | $645 |
| LLM (processing) | GPT-5.4 mini | $3.75 | $112 |
| TTS (output) | OpenAI TTS HD | $56.25 | $1,687 |
| Total | $81.50 | $2,444 |
Or using OpenAI Realtime-2 end-to-end (no separate STT/TTS): approximately $2,400/month for the same volume.
Common Pitfalls
- Ignoring audio format requirements — Each API expects specific sample rates (16kHz, 44.1kHz) and formats (PCM16, MP3, WAV). Mismatched formats produce garbled output.
- Not handling silence — VAD (Voice Activity Detection) is crucial. Without it, you send silence to the STT API, wasting money and getting empty results.
- Sequential STT → LLM → TTS instead of streaming — Waiting for complete transcription before sending to LLM adds latency. Stream each component.
- Underestimating TTS costs — TTS is often the most expensive component. A 100-character response costs more in TTS than in LLM processing.
- Not testing with real accents and noise — Clean studio audio works great with every STT. Real-world noisy phone calls are a different story.
- Voice cloning without consent — Ethical and legal requirements vary by jurisdiction. Always get explicit consent before cloning someone's voice.
Conclusion
Voice AI has crossed the quality threshold in 2026. For speech-to-text, Deepgram offers the best latency and value; Google Chirp 2 handles the hardest audio conditions. For text-to-speech, ElevenLabs is the quality king, while OpenAI TTS is the value choice. And for real-time voice conversations, OpenAI's Realtime-2 API is the only production-ready option today.
The key decision: do you need STT → LLM → TTS (more control, lower cost) or Realtime-2 (simpler, lower latency)? For most production voice chatbots, Realtime-2 is worth the premium. For transcription and other non-conversational use cases, the component approach gives you more flexibility and lower cost.
Related Guides: Streaming Responses Guide · Function Calling Guide · How to Choose an AI API