Model Pricing
Understand voice and audio through a suite of leading audio-native AI models, with options for latency-sensitive real-time applications, large-scale batch uploads, and anything in between.
Understand voice and audio through a suite of leading audio-native AI models, with options for latency-sensitive real-time applications, large-scale batch uploads, and anything in between.
Our flagship conversation-understanding audio-native model. It analyzes the call itself, not just a transcript.
Its core feature is behaviors: describe what to watch for in plain language, and Velma flags each instance with confidence, reasoning, and the clips it fired on — 146 editable templates ship across safety, compliance, and customer experience.
Every call also returns a diarized transcript, conversation type, speaker roles, topics with sentiment, and a summary.
| Model Type | Batch | Streaming |
|---|---|---|
| Batch | Streaming | |
| Description | Full conversation understanding for recordings — one REST call returns a diarized transcript, conversation type, speaker roles, topics, sentiment, a summary, and every behavior you configure | Real-time conversation understanding over WebSocket — clips, behavior detections, and aggregate results are emitted progressively as audio streams in |
| API | REST API | WebSocket |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats |
| Concurrency Quota | 5 concurrent requests | 5 concurrent requests |
| Monthly Usage Quota | 1,000 hours | 1,000 hours |
| Pricing | $1.25 / hour | $1.25 / hour |
| Dynamic Behaviors | ||
| Batch | Streaming | |
| Fully Customizable | Define exactly what Velma should detect in plain language — a name and a description — and scope each behavior to specific conversation types or speaker roles | |
| Behavior Templates | 146 ready-made behaviors to get you started, spanning safety, compliance, and customer experience — use any as-is or as a starting point for your own | |
| Confidence Scoring | ✓ | ✓ |
| Reasoning per Detection | ✓ | ✓ |
| Evidence Clips | Every detection links the clips it fired on, with a single definitive clip | |
| Returned on Every Call | ||
| Batch | Streaming | |
| Diarized Transcription | ✓ | ✓ |
| Word-Level Timestamps | ✓ | ✓ |
| Language | Detected per clip; optionally set a language hint for the conversation | |
| Conversation Type | Inferred for the session with a confidence score and reasoning — or set your own default | |
| Participant Roles | Per-speaker role inference (e.g. agent vs. caller) with confidence and reasoning | |
| Topics | ✓ | ✓ |
| Topic Sentiment | Per-speaker sentiment for each topic, scored from −1 to 1 | |
| Conversation Summary | ✓ | ✓ |
| Optional Signals | ||
| Batch | Streaming | |
| Emotion Detection | ✓ | ✓ |
| Accent Identification | ✓ | ✓ |
| Deepfake Score | Per-clip synthetic-voice score from 0 to 1 | |
| PII/PHI Tagging | ✓ | ✓ |
English-only transcription optimized for speed. The batch API trades enrichment features for the lowest possible turnaround, and the streaming WebSocket emits rolling partial transcripts every ~1.5 seconds — built for live captions, voice assistants, and high-throughput pipelines.
| Model Type | Batch | Streaming |
|---|---|---|
| Batch | Streaming | |
| Description | High-throughput English batch processing with >200x real-time speed | Low-latency English streaming transcription via WebSocket with real-time partial transcripts |
| API | REST API | WebSocket |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats |
| Concurrency Quota | 25 concurrent requests | 5 concurrent requests |
| Monthly Usage Quota | 10,000 hours | 10,000 hours |
| Pricing | $0.025 / hour | $0.05 / hour |
| Built-in features | ||
| Batch | Streaming | |
| Transcription | ✓ | ✓ |
| Auto Capitalization | ✓ | ✓ |
| Auto Punctuation | ✓ | ✓ |
| Language | English | English |
| Real-Time | ✓ | |
Transcription in 70+ languages with per-utterance timing, speaker diarization, and optional emotion, accent, deepfake, and PII/PHI signals — available as a batch REST API and a real-time streaming WebSocket.
| Model Type | Batch | Streaming |
|---|---|---|
| Batch | Streaming | |
| Description | Multilingual batch transcription in 70+ languages with full feature set | Real-time streaming transcription in 70+ languages via WebSocket |
| API | REST API | WebSocket |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM |
| Concurrency Quota | 5 concurrent requests | 5 concurrent requests |
| Monthly Usage Quota | 1,000 hours | 1,000 hours |
| Pricing | $0.03 / hour | $0.06 / hour |
| Built-in features | ||
| Batch | Streaming | |
| Transcription | ✓ | ✓ |
| Auto Capitalization | ✓ | ✓ |
| Auto Punctuation | ✓ | ✓ |
| Language | 70+ languages | 70+ languages |
| Real-Time | ✓ | |
| Optional features | ||
| Batch | Streaming | |
| Speaker Diarization | ✓ | ✓ |
| Emotion Detection | ✓ | ✓ |
| Accent Identification | ✓ | ✓ |
| PII/PHI Tagging | $0.02 / hour | $0.02 / hour |
| Deepfake Detection | $0.25 / hour | $0.25 / hour |
The Deepfake Detection models achieve state-of-the-art performance detecting speech deepfakes on single-speaker audio.
| Model Type | Batch | Streaming |
|---|---|---|
| Batch | Streaming | |
| Description | Speech deepfake prediction for audio files with a single speaker | Real-time speech deepfake detection for streaming audio with a single speaker |
| Accuracy | 98.9% average — #1 on Speech DF Arena | 98.9% average — #1 on Speech DF Arena |
| API | REST API | WebSocket |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | WAV, AIFF, FLAC, MP3, AAC, OGG, WebM, and raw PCM (various bit depths and encodings) |
| Concurrency Quota | 3 concurrent requests | 3 concurrent requests |
| Monthly Usage Quota | 1,000 hours | 1,000 hours |
| Pricing | $0.25 / hour | $0.25 / hour |
| Built-in features | ||
| Batch | Streaming | |
| Per-Window Prediction | ✓ | ✓ |
| Confidence Scoring | ✓ | ✓ |
| Silence Detection | ✓ | ✓ |
| Flexible Chunk Size | ✓ | |
| Real-Time | ✓ | |
| First Prediction | After full file upload | At 500ms of audio |
PII/PHI Redaction (Personally Identifiable Information / Protected Health Information) transcribes audio in real time, detects and redacts sensitive entities from both the transcript text and the audio — returning a clean MP3 and a redacted transcript with 100+ entity types supported.
| Model Type | Batch | Streaming |
|---|---|---|
| Batch | Streaming | |
| Description | Transcribe an audio file and receive a redacted transcript and redacted MP3 with PII/PHI ranges silenced | Real-time PII/PHI redaction via WebSocket — streams audio in, receives redacted transcript text and redacted MP3 clips as each utterance completes |
| API | REST API | WebSocket |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats |
| Pricing | $0.05 / hour | $0.08 / hour |
| Outputs | ||
| Batch | Streaming | |
| Redacted Transcript | ✓ | ✓ |
| Redacted Audio (MP3) | ✓ | ✓ |
| Redaction Ranges |
Full list of silenced [start_ms, end_ms] ranges
|
Per-utterance timing with each audio clip |
| PII/PHI Categories Detected | ||
| Batch | Streaming | |
| Coverage | 100+ PII/PHI entity types across all categories | |
| Names & Identifiers | Full names, usernames, dates of birth | |
| Government IDs | SSN, passport, driver's license | |
| Addresses & Locations | Street addresses, ZIP codes, cities | |
| Contact Details | Phone numbers, email addresses | |
| Financial Info | Credit card numbers, bank accounts | |
| Medical / Health Info (PHI) | Diagnoses, medications, health record IDs | |
| Built-in features | ||
| Batch | Streaming | |
| Speaker Diarization | ✓ | ✓ |
| Configurable Redaction Padding | Adjustable silence buffer before and after each redacted range | |
| Real-Time | ✓ | |
| First Result | After full file upload | At utterance completion |
The Music & Speech Detection models classify audio at the frame level (~192ms resolution), returning independent music and speech probabilities for every frame. Useful for content moderation, ad-break detection, podcast segmentation, and audio analytics — available as both a low-latency batch API and a real-time streaming WebSocket.
| Model Type | Batch | Streaming |
|---|---|---|
| Batch | Streaming | |
| Description | Frame-level music and speech classification for audio files, with a primary label and percentage breakdown across the clip | Real-time frame-level music and speech classification via WebSocket — frames are emitted progressively as audio streams in |
| API | REST API | WebSocket |
| Accepted Files | AAC, FLAC, M4A, MP3, MP4, OGG, Opus, WAV | AAC, AIFF, FLAC, MP3, OGG, WAV, WebM, and raw PCM (various bit depths and encodings) |
| Concurrency Quota | 3 concurrent requests | 3 concurrent requests |
| Monthly Usage Quota | 1,000 hours | 1,000 hours |
| Pricing | $0.02 / hour | $0.02 / hour |
| Built-in features | ||
| Batch | Streaming | |
| Music Detection | ✓ | ✓ |
| Speech Detection | ✓ | ✓ |
| Frame-Level Output | Per-frame probabilities at ~192ms resolution | |
| Non-Exclusive Labels | Music and speech probabilities are independent — both can be high simultaneously (e.g. music with vocals) | |
| Primary Label |
Overall clip classification: music,
speech, or neither
|
|
| Percentage Breakdown | Percentage of clip where music / speech probability ≥ 0.5 | |
| Flexible Chunk Size | Send audio in any chunk size that suits your pipeline | |
| Real-Time | ✓ | |
| First Result | After full file upload and processing | After first ~192ms of audio |
Language Detection identifies the spoken language of an audio file from 100 supported languages, returning a human-readable language name, an ISO 639-1 code, and a confidence score in a single synchronous request. Only the first 30 seconds of audio are analyzed.
| Description | Identify the spoken language of an audio file in a single synchronous request — returns the language name, its ISO code, and a confidence score |
| API | REST API |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM |
| Maximum File Size | 100 MB |
| Analysis Window | First 30 seconds of audio (longer files are accepted, but the additional audio is ignored) |
| Language Coverage | Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba |
| Concurrency Quota | 25 concurrent requests |
| Monthly Usage Quota | 10,000 hours |
| Pricing | $0.01 / hour |
| Outputs | |
| Detected Language | Human-readable name (e.g. English, French, Mandarin) |
| Language Code |
Lowercase ISO 639-1 code (e.g. en, fr,
zh)
|
| Confidence Score | Probability from 0.0 to 1.0 for the predicted language |
| Audio Duration | Total decoded audio duration in milliseconds |