Velma Models

Velma handles transcription, emotion, accent, and engagement detection across 70+ languages. Choose the model that fits your use case.

Transcription

	Batch English Fast	Batch Multilingual	Streaming Multilingual
Description	High-throughput English batch processing with >200x real-time speed	Multilingual batch transcription in 70+ languages with full feature set	Real-time streaming transcription in 70+ languages via WebSocket
API	REST API	REST API	WebSocket
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM
Concurrency Quota	25 concurrent requests	5 concurrent requests	5 concurrent requests
Monthly Usage Quota	10,000 hours	1,000 hours	1,000 hours
Pricing	$0.025 / hour	$0.03 / hour	$0.06 / hour
Built-in features
Transcription	✓	✓	✓
Auto Capitalization	✓	✓	✓
Auto Punctuation	✓	✓	✓
Language	English	70+ languages	70+ languages
Real-Time			✓
Optional features
Speaker Diarization		✓	✓
Emotion Detection		✓	✓
Accent Identification		✓	✓
PII/PHI Tagging		✓	✓

Deepfake Detection

Velma-2's synthetic voice detection models achieve state-of-the-art performance detecting speech deepfakes on single-speaker audio.

	Batch	Streaming
Description	Speech deepfake prediction for audio files with a single speaker	Real-time speech deepfake detection for streaming audio with a single speaker
Accuracy	98.9% average — #1 on Speech DF Arena	98.9% average — #1 on Speech DF Arena
API	REST API	WebSocket
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM	Raw 16kHz mono PCM (int16 LE)
Concurrency Quota	3 concurrent requests	3 concurrent requests
Monthly Usage Quota	1,000 hours	1,000 hours
Pricing	$0.25 / hour	$0.25 / hour
Built-in features
Per-Window Prediction	✓	✓
Confidence Scoring	✓	✓
Silence Detection	✓	✓
Flexible Chunk Size		✓
Real-Time		✓
First Prediction	After full file upload	At 500ms of audio

PII/PHI Redaction

Velma-2 PII/PHI Redaction (Personally Identifiable Information / Protected Health Information) transcribes audio in real time, detects and redacts sensitive entities from both the transcript text and the audio — returning a clean MP3 and a redacted transcript with 100+ entity types supported.

	Batch	Streaming
Description	Transcribe an audio file and receive a redacted transcript and redacted MP3 with PII/PHI ranges silenced	Real-time PII/PHI redaction via WebSocket — streams audio in, receives redacted transcript text and redacted MP3 clips as each utterance completes
API	REST API	WebSocket
Accepted Files	AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM	AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats
Pricing	$0.05 / hour	$0.08 / hour
Outputs
Redacted Transcript	✓	✓
Redacted Audio (MP3)	✓	✓
Redaction Ranges	Full list of silenced `[start_ms, end_ms]` ranges	Per-utterance timing with each audio clip
PII/PHI Categories Detected
Coverage	100+ PII/PHI entity types across all categories	100+ PII/PHI entity types across all categories
Names & Identifiers	Full names, usernames, dates of birth	Full names, usernames, dates of birth
Government IDs	SSN, passport, driver’s license	SSN, passport, driver’s license
Address & Locations	Street addresses, ZIP codes, cities	Street addresses, ZIP codes, cities
Contact Details	Phone numbers, email addresses	Phone numbers, email addresses
Financial Info	Credit card numbers, bank accounts	Credit card numbers, bank accounts
Medical / Health Info (PHI)	Diagnoses, medications, health record IDs	Diagnoses, medications, health record IDs
Built-In Features
Speaker Diarization	✓	✓
Configurable Redaction Padding	Adjustable silence buffer before and after each redacted range	Adjustable silence buffer before and after each redacted range
Real-Time		✓
First Result	After full file upload	At utterance completion

Velma Models

Transcription

Built-in features

Optional features

Deepfake Detection

Built-in features

PII/PHI Redaction

Outputs

PII/PHI Categories Detected

Built-In Features