Model Pricing

Understand voice and audio through a suite of leading audio-native AI models, with options for latency-sensitive real-time applications, large-scale batch uploads, and anything in between.

Triage

Velma Triage Velma Triage Mini ToxMod Triage

Analysis

Velma Velma Agentic

Redaction

PII/PHI

Transcription

Multilingual English Fast Medical Agentic

Detection

Deepfake Language AI Music & Singing Music & Speech Emotion Voicemail Audio Event

Velma Triage

Our flagship conversation-understanding audio-native model. It analyzes the call itself, not just a transcript.

Its core feature is behaviors: describe what to watch for in plain language, and Velma flags each instance with confidence, reasoning, and the clips it fired on — 146 editable templates ship across safety, compliance, and customer experience.

Every call also returns a diarized transcript, conversation type, speaker roles, topics with sentiment, and a summary.

Model Type	Batch	Streaming
	Batch	Streaming
Description	Full conversation understanding for recordings — one REST call returns a diarized transcript, conversation type, speaker roles, topics, sentiment, a summary, and every behavior you configure	Real-time conversation understanding over WebSocket — clips, behavior detections, and aggregate results are emitted progressively as audio streams in
API	REST API	WebSocket
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM	AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats
Concurrency Quota	5 concurrent requests	5 concurrent requests
Monthly Usage Quota	1,000 hours	1,000 hours
Pricing	$1.25 / hour	$1.25 / hour
Dynamic Behaviors
	Batch	Streaming
Fully Customizable	Define exactly what Velma should detect in plain language — a name and a description — and scope each behavior to specific conversation types or speaker roles
Behavior Templates	146 ready-made behaviors to get you started, spanning safety, compliance, and customer experience — use any as-is or as a starting point for your own
Confidence Scoring	✓	✓
Reasoning per Detection	✓	✓
Evidence Clips	Every detection links the clips it fired on, with a single definitive clip
Returned on Every Call
	Batch	Streaming
Diarized Transcription	✓	✓
Word-Level Timestamps	✓	✓
Language	Detected per clip; optionally set a language hint for the conversation
Conversation Type	Inferred for the session with a confidence score and reasoning — or set your own default
Participant Roles	Per-speaker role inference (e.g. agent vs. caller) with confidence and reasoning
Topics	✓	✓
Topic Sentiment	Per-speaker sentiment for each topic, scored from −1 to 1
Conversation Summary	✓	✓
Optional Signals
	Batch	Streaming
Emotion Detection	✓	✓
Accent Identification	✓	✓
Deepfake Score	Per-clip synthetic-voice score from 0 to 1
PII/PHI Tagging	✓	✓

Model Type

Batch

Streaming

Batch

Streaming

Description

Full conversation understanding for recordings — one REST call returns a diarized transcript, conversation type, speaker roles, topics, sentiment, a summary, and every behavior you configure

Real-time conversation understanding over WebSocket — clips, behavior detections, and aggregate results are emitted progressively as audio streams in

API

REST API

WebSocket

Accepted Files

AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM

AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats

Concurrency Quota

5 concurrent requests

Monthly Usage Quota

1,000 hours

Pricing

$1.25 / hour

Dynamic Behaviors

Batch

Streaming

Fully Customizable

Define exactly what Velma should detect in plain language — a name and a description — and scope each behavior to specific conversation types or speaker roles

Behavior Templates

146 ready-made behaviors to get you started, spanning safety, compliance, and customer experience — use any as-is or as a starting point for your own

Confidence Scoring

✓

Reasoning per Detection

✓

Evidence Clips

Every detection links the clips it fired on, with a single definitive clip

Returned on Every Call

Batch

Streaming

Diarized Transcription

✓

Word-Level Timestamps

✓

Language

Detected per clip; optionally set a language hint for the conversation

Conversation Type

Inferred for the session with a confidence score and reasoning — or set your own default

Participant Roles

Per-speaker role inference (e.g. agent vs. caller) with confidence and reasoning

Topics

✓

Topic Sentiment

Per-speaker sentiment for each topic, scored from −1 to 1

Conversation Summary

✓

Optional Signals

Batch

Streaming

Emotion Detection

✓

Accent Identification

✓

Deepfake Score

Per-clip synthetic-voice score from 0 to 1

PII/PHI Tagging

✓

English Fast Transcription

English-only transcription optimized for speed. The batch API trades enrichment features for the lowest possible turnaround, and the streaming WebSocket emits rolling partial transcripts every ~1.5 seconds — built for live captions, voice assistants, and high-throughput pipelines.

Model Type	Batch	Streaming
	Batch	Streaming
Description	High-throughput English batch processing with >200x real-time speed	Low-latency English streaming transcription via WebSocket with real-time partial transcripts
API	REST API	WebSocket
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM	AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats
Concurrency Quota	25 concurrent requests	5 concurrent requests
Monthly Usage Quota	10,000 hours	10,000 hours
Pricing	$0.025 / hour	$0.05 / hour
Built-in features
	Batch	Streaming
Transcription	✓	✓
Auto Capitalization	✓	✓
Auto Punctuation	✓	✓
Language	English	English
Real-Time		✓

Model Type

Batch

Streaming

Batch

Streaming

Description

High-throughput English batch processing with >200x real-time speed

Low-latency English streaming transcription via WebSocket with real-time partial transcripts

API

REST API

WebSocket

Accepted Files

AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM

AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats

Concurrency Quota

25 concurrent requests

5 concurrent requests

Monthly Usage Quota

10,000 hours

Pricing

$0.025 / hour

$0.05 / hour

Built-in features

Batch

Streaming

Transcription

✓

Auto Capitalization

✓

Auto Punctuation

✓

Language

English

Real-Time

✓

Model Type	Batch	Streaming
	Batch	Streaming
Description	Multilingual batch transcription in 70+ languages with full feature set	Real-time streaming transcription in 70+ languages via WebSocket
API	REST API	WebSocket
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM
Concurrency Quota	5 concurrent requests	5 concurrent requests
Monthly Usage Quota	1,000 hours	1,000 hours
Pricing	$0.03 / hour	$0.06 / hour
Built-in features
	Batch	Streaming
Transcription	✓	✓
Auto Capitalization	✓	✓
Auto Punctuation	✓	✓
Language	70+ languages	70+ languages
Real-Time		✓
Optional features
	Batch	Streaming
Speaker Diarization	✓	✓
Emotion Detection	✓	✓
Accent Identification	✓	✓
PII/PHI Tagging	$0.02 / hour	$0.02 / hour
Deepfake Detection	$0.25 / hour	$0.25 / hour

Model Type

Batch

Streaming

Batch

Streaming

Description

Multilingual batch transcription in 70+ languages with full feature set

Real-time streaming transcription in 70+ languages via WebSocket

API

REST API

WebSocket

Accepted Files

AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM

Concurrency Quota

5 concurrent requests

Monthly Usage Quota

1,000 hours

Pricing

$0.03 / hour

$0.06 / hour

Built-in features

Batch

Streaming

Transcription

✓

Auto Capitalization

✓

Auto Punctuation

✓

Language

70+ languages

Real-Time

✓

Optional features

Batch

Streaming

Speaker Diarization

✓

Emotion Detection

✓

Accent Identification

✓

PII/PHI Tagging

$0.02 / hour

Deepfake Detection

$0.25 / hour

Model Type	Batch	Streaming
	Batch	Streaming
Description	Speech deepfake prediction for audio files with a single speaker	Real-time speech deepfake detection for streaming audio with a single speaker
Accuracy	98.9% average — #1 on Speech DF Arena	98.9% average — #1 on Speech DF Arena
API	REST API	WebSocket
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM	WAV, AIFF, FLAC, MP3, AAC, OGG, WebM, and raw PCM (various bit depths and encodings)
Concurrency Quota	3 concurrent requests	3 concurrent requests
Monthly Usage Quota	1,000 hours	1,000 hours
Pricing	$0.25 / hour	$0.25 / hour
Built-in features
	Batch	Streaming
Per-Window Prediction	✓	✓
Confidence Scoring	✓	✓
Silence Detection	✓	✓
Flexible Chunk Size		✓
Real-Time		✓
First Prediction	After full file upload	At 500ms of audio

Model Type

Batch

Streaming

Batch

Streaming

Description

Speech deepfake prediction for audio files with a single speaker

Real-time speech deepfake detection for streaming audio with a single speaker

Accuracy

98.9% average — #1 on Speech DF Arena

API

REST API

WebSocket

Accepted Files

AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM

WAV, AIFF, FLAC, MP3, AAC, OGG, WebM, and raw PCM (various bit depths and encodings)

Concurrency Quota

3 concurrent requests

Monthly Usage Quota

1,000 hours

Pricing

$0.25 / hour

Built-in features

Batch

Streaming

Per-Window Prediction

✓

Confidence Scoring

✓

Silence Detection

✓

Flexible Chunk Size

✓

Real-Time

✓

First Prediction

After full file upload

At 500ms of audio

PII/PHI Redaction

PII/PHI Redaction (Personally Identifiable Information / Protected Health Information) transcribes audio in real time, detects and redacts sensitive entities from both the transcript text and the audio — returning a clean MP3 and a redacted transcript with 100+ entity types supported.

Model Type	Batch	Streaming
	Batch	Streaming
Description	Transcribe an audio file and receive a redacted transcript and redacted MP3 with PII/PHI ranges silenced	Real-time PII/PHI redaction via WebSocket — streams audio in, receives redacted transcript text and redacted MP3 clips as each utterance completes
API	REST API	WebSocket
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM	AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats
Pricing	$0.05 / hour	$0.08 / hour
Outputs
	Batch	Streaming
Redacted Transcript	✓	✓
Redacted Audio (MP3)	✓	✓
Redaction Ranges	Full list of silenced `[start_ms, end_ms]` ranges	Per-utterance timing with each audio clip
PII/PHI Categories Detected
	Batch	Streaming
Coverage	100+ PII/PHI entity types across all categories
Names & Identifiers	Full names, usernames, dates of birth
Government IDs	SSN, passport, driver's license
Addresses & Locations	Street addresses, ZIP codes, cities
Contact Details	Phone numbers, email addresses
Financial Info	Credit card numbers, bank accounts
Medical / Health Info (PHI)	Diagnoses, medications, health record IDs
Built-in features
	Batch	Streaming
Speaker Diarization	✓	✓
Configurable Redaction Padding	Adjustable silence buffer before and after each redacted range
Real-Time		✓
First Result	After full file upload	At utterance completion

Model Type

Batch

Streaming

Batch

Streaming

Description

Transcribe an audio file and receive a redacted transcript and redacted MP3 with PII/PHI ranges silenced

Real-time PII/PHI redaction via WebSocket — streams audio in, receives redacted transcript text and redacted MP3 clips as each utterance completes

API

REST API

WebSocket

Accepted Files

AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM

AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats

Pricing

$0.05 / hour

$0.08 / hour

Outputs

Batch

Streaming

Redacted Transcript

✓

Redacted Audio (MP3)

✓

Redaction Ranges

Full list of silenced [start_ms, end_ms] ranges

Per-utterance timing with each audio clip

PII/PHI Categories Detected

Batch

Streaming

Coverage

100+ PII/PHI entity types across all categories

Names & Identifiers

Full names, usernames, dates of birth

Government IDs

SSN, passport, driver's license

Addresses & Locations

Street addresses, ZIP codes, cities

Contact Details

Phone numbers, email addresses

Financial Info

Credit card numbers, bank accounts

Medical / Health Info (PHI)

Diagnoses, medications, health record IDs

Built-in features

Batch

Streaming

Speaker Diarization

✓

Configurable Redaction Padding

Adjustable silence buffer before and after each redacted range

Real-Time

✓

First Result

After full file upload

At utterance completion

Music & Speech Detection

The Music & Speech Detection models classify audio at the frame level (~192ms resolution), returning independent music and speech probabilities for every frame. Useful for content moderation, ad-break detection, podcast segmentation, and audio analytics — available as both a low-latency batch API and a real-time streaming WebSocket.

Model Type	Batch	Streaming
	Batch	Streaming
Description	Frame-level music and speech classification for audio files, with a primary label and percentage breakdown across the clip	Real-time frame-level music and speech classification via WebSocket — frames are emitted progressively as audio streams in
API	REST API	WebSocket
Accepted Files	AAC, FLAC, M4A, MP3, MP4, OGG, Opus, WAV	AAC, AIFF, FLAC, MP3, OGG, WAV, WebM, and raw PCM (various bit depths and encodings)
Concurrency Quota	3 concurrent requests	3 concurrent requests
Monthly Usage Quota	1,000 hours	1,000 hours
Pricing	$0.02 / hour	$0.02 / hour
Built-in features
	Batch	Streaming
Music Detection	✓	✓
Speech Detection	✓	✓
Frame-Level Output	Per-frame probabilities at ~192ms resolution
Non-Exclusive Labels	Music and speech probabilities are independent — both can be high simultaneously (e.g. music with vocals)
Primary Label	Overall clip classification: `music`, `speech`, or `neither`
Percentage Breakdown	Percentage of clip where music / speech probability ≥ 0.5
Flexible Chunk Size		Send audio in any chunk size that suits your pipeline
Real-Time		✓
First Result	After full file upload and processing	After first ~192ms of audio

Model Type

Batch

Streaming

Batch

Streaming

Description

Frame-level music and speech classification for audio files, with a primary label and percentage breakdown across the clip

Real-time frame-level music and speech classification via WebSocket — frames are emitted progressively as audio streams in

API

REST API

WebSocket

Accepted Files

AAC, FLAC, M4A, MP3, MP4, OGG, Opus, WAV

AAC, AIFF, FLAC, MP3, OGG, WAV, WebM, and raw PCM (various bit depths and encodings)

Concurrency Quota

3 concurrent requests

Monthly Usage Quota

1,000 hours

Pricing

$0.02 / hour

Built-in features

Batch

Streaming

Music Detection

✓

Speech Detection

✓

Frame-Level Output

Per-frame probabilities at ~192ms resolution

Non-Exclusive Labels

Music and speech probabilities are independent — both can be high simultaneously (e.g. music with vocals)

Primary Label

Overall clip classification: music, speech, or neither

Percentage Breakdown

Percentage of clip where music / speech probability ≥ 0.5

Flexible Chunk Size

Send audio in any chunk size that suits your pipeline

Real-Time

✓

First Result

After full file upload and processing

After first ~192ms of audio

Language Detection

Language Detection identifies the spoken language of an audio file from 100 supported languages, returning a human-readable language name, an ISO 639-1 code, and a confidence score in a single synchronous request. Only the first 30 seconds of audio are analyzed.

Description	Identify the spoken language of an audio file in a single synchronous request — returns the language name, its ISO code, and a confidence score
API	REST API
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM
Maximum File Size	100 MB
Analysis Window	First 30 seconds of audio (longer files are accepted, but the additional audio is ignored)
Language Coverage	Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba
Concurrency Quota	25 concurrent requests
Monthly Usage Quota	10,000 hours
Pricing	$0.01 / hour
Outputs
Detected Language	Human-readable name (e.g. English, French, Mandarin)
Language Code	Lowercase ISO 639-1 code (e.g. `en`, `fr`, `zh`)
Confidence Score	Probability from 0.0 to 1.0 for the predicted language
Audio Duration	Total decoded audio duration in milliseconds

Model Pricing

Velma Triage

English Fast Transcription

Multilingual Transcription

Deepfake Detection

PII/PHI Redaction

Music & Speech Detection

Language Detection