Modulate

Velma Models

Velma handles transcription, emotion, accent, and engagement detection across 70+ languages. Choose the model that fits your use case.

Transcription

Batch English Fast Batch Multilingual Streaming Multilingual
Description High-throughput English batch processing with >200x real-time speed Multilingual batch transcription in 70+ languages with full feature set Real-time streaming transcription in 70+ languages via WebSocket
API REST API REST API WebSocket
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM
Concurrency Quota 25 concurrent requests 5 concurrent requests 5 concurrent requests
Monthly Usage Quota 10,000 hours 1,000 hours 1,000 hours
Pricing $0.025 / hour $0.03 / hour $0.06 / hour
Built-in features
Transcription
Auto Capitalization
Auto Punctuation
Language English 70+ languages 70+ languages
Real-Time
Optional features
Speaker Diarization
Emotion Detection
Accent Identification
PII/PHI Tagging

Deepfake Detection

Velma-2's synthetic voice detection models achieve state-of-the-art performance detecting speech deepfakes on single-speaker audio.

Batch Streaming
Description Speech deepfake prediction for audio files with a single speaker Real-time speech deepfake detection for streaming audio with a single speaker
Accuracy 98.9% average — #1 on Speech DF Arena 98.9% average — #1 on Speech DF Arena
API REST API WebSocket
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM Raw 16kHz mono PCM (int16 LE)
Concurrency Quota 3 concurrent requests 3 concurrent requests
Monthly Usage Quota 1,000 hours 1,000 hours
Pricing $0.25 / hour $0.25 / hour
Built-in features
Per-Window Prediction
Confidence Scoring
Silence Detection
Flexible Chunk Size
Real-Time
First Prediction After full file upload At 500ms of audio

PII/PHI Redaction

Velma-2 PII/PHI Redaction (Personally Identifiable Information / Protected Health Information) transcribes audio in real time, detects and redacts sensitive entities from both the transcript text and the audio — returning a clean MP3 and a redacted transcript with 100+ entity types supported.

Batch Streaming
Description Transcribe an audio file and receive a redacted transcript and redacted MP3 with PII/PHI ranges silenced Real-time PII/PHI redaction via WebSocket — streams audio in, receives redacted transcript text and redacted MP3 clips as each utterance completes
API REST API WebSocket
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats
Pricing $0.05 / hour $0.08 / hour
Outputs
Redacted Transcript
Redacted Audio (MP3)
Redaction Ranges Full list of silenced [start_ms, end_ms] ranges Per-utterance timing with each audio clip
PII/PHI Categories Detected
Coverage 100+ PII/PHI entity types across all categories 100+ PII/PHI entity types across all categories
Names & Identifiers Full names, usernames, dates of birth Full names, usernames, dates of birth
Government IDs SSN, passport, driver’s license SSN, passport, driver’s license
Address & Locations Street addresses, ZIP codes, cities Street addresses, ZIP codes, cities
Contact Details Phone numbers, email addresses Phone numbers, email addresses
Financial Info Credit card numbers, bank accounts Credit card numbers, bank accounts
Medical / Health Info (PHI) Diagnoses, medications, health record IDs Diagnoses, medications, health record IDs
Built-In Features
Speaker Diarization
Configurable Redaction Padding Adjustable silence buffer before and after each redacted range Adjustable silence buffer before and after each redacted range
Real-Time
First Result After full file upload At utterance completion