Velma Models
Velma handles transcription, emotion, accent, and engagement detection across 70+ languages. Choose the model that fits your use case.
Velma handles transcription, emotion, accent, and engagement detection across 70+ languages. Choose the model that fits your use case.
| Batch English Fast | Batch Multilingual | Streaming Multilingual | |
|---|---|---|---|
| Description | High-throughput English batch processing with >200x real-time speed | Multilingual batch transcription in 70+ languages with full feature set | Real-time streaming transcription in 70+ languages via WebSocket |
| API | REST API | REST API | WebSocket |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | ||
| Concurrency Quota | 25 concurrent requests | 5 concurrent requests | 5 concurrent requests |
| Monthly Usage Quota | 10,000 hours | 1,000 hours | 1,000 hours |
| Pricing | $0.025 / hour | $0.03 / hour | $0.06 / hour |
Built-in features |
|||
| Transcription | ✓ | ✓ | ✓ |
| Auto Capitalization | ✓ | ✓ | ✓ |
| Auto Punctuation | ✓ | ✓ | ✓ |
| Language | English | 70+ languages | 70+ languages |
| Real-Time | ✓ | ||
Optional features |
|||
| Speaker Diarization | ✓ | ✓ | |
| Emotion Detection | ✓ | ✓ | |
| Accent Identification | ✓ | ✓ | |
| PII/PHI Tagging | ✓ | ✓ | |
Velma-2's synthetic voice detection models achieve state-of-the-art performance detecting speech deepfakes on single-speaker audio.
| Batch | Streaming | |
|---|---|---|
| Description | Speech deepfake prediction for audio files with a single speaker | Real-time speech deepfake detection for streaming audio with a single speaker |
| Accuracy | 98.9% average — #1 on Speech DF Arena | 98.9% average — #1 on Speech DF Arena |
| API | REST API | WebSocket |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | Raw 16kHz mono PCM (int16 LE) |
| Concurrency Quota | 3 concurrent requests | 3 concurrent requests |
| Monthly Usage Quota | 1,000 hours | 1,000 hours |
| Pricing | $0.25 / hour | $0.25 / hour |
Built-in features |
||
| Per-Window Prediction | ✓ | ✓ |
| Confidence Scoring | ✓ | ✓ |
| Silence Detection | ✓ | ✓ |
| Flexible Chunk Size | ✓ | |
| Real-Time | ✓ | |
| First Prediction | After full file upload | At 500ms of audio |
Velma-2 PII/PHI Redaction (Personally Identifiable Information / Protected Health Information) transcribes audio in real time, detects and redacts sensitive entities from both the transcript text and the audio — returning a clean MP3 and a redacted transcript with 100+ entity types supported.
| Batch | Streaming | |
|---|---|---|
| Description | Transcribe an audio file and receive a redacted transcript and redacted MP3 with PII/PHI ranges silenced | Real-time PII/PHI redaction via WebSocket — streams audio in, receives redacted transcript text and redacted MP3 clips as each utterance completes |
| API | REST API | WebSocket |
| Accepted Files | AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM | AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats |
| Pricing | $0.05 / hour | $0.08 / hour |
Outputs |
||
| Redacted Transcript | ✓ | ✓ |
| Redacted Audio (MP3) | ✓ | ✓ |
| Redaction Ranges |
Full list of silenced [start_ms, end_ms] ranges
|
Per-utterance timing with each audio clip |
PII/PHI Categories Detected |
||
| Coverage | 100+ PII/PHI entity types across all categories | 100+ PII/PHI entity types across all categories |
| Names & Identifiers | Full names, usernames, dates of birth | Full names, usernames, dates of birth |
| Government IDs | SSN, passport, driver’s license | SSN, passport, driver’s license |
| Address & Locations | Street addresses, ZIP codes, cities | Street addresses, ZIP codes, cities |
| Contact Details | Phone numbers, email addresses | Phone numbers, email addresses |
| Financial Info | Credit card numbers, bank accounts | Credit card numbers, bank accounts |
| Medical / Health Info (PHI) | Diagnoses, medications, health record IDs | Diagnoses, medications, health record IDs |
Built-In Features |
||
| Speaker Diarization | ✓ | ✓ |
| Configurable Redaction Padding | Adjustable silence buffer before and after each redacted range | Adjustable silence buffer before and after each redacted range |
| Real-Time | ✓ | |
| First Result | After full file upload | At utterance completion |