Modulate

Model Pricing

Understand voice and audio through a suite of leading audio-native AI models, with options for latency-sensitive real-time applications, large-scale batch uploads, and anything in between.

Velma Triage

Our flagship conversation-understanding audio-native model. It analyzes the call itself, not just a transcript.

Its core feature is behaviors: describe what to watch for in plain language, and Velma flags each instance with confidence, reasoning, and the clips it fired on — 146 editable templates ship across safety, compliance, and customer experience.

Every call also returns a diarized transcript, conversation type, speaker roles, topics with sentiment, and a summary.

Model Type Batch Streaming
Batch Streaming
Description Full conversation understanding for recordings — one REST call returns a diarized transcript, conversation type, speaker roles, topics, sentiment, a summary, and every behavior you configure Real-time conversation understanding over WebSocket — clips, behavior detections, and aggregate results are emitted progressively as audio streams in
API REST API WebSocket
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats
Concurrency Quota 5 concurrent requests 5 concurrent requests
Monthly Usage Quota 1,000 hours 1,000 hours
Pricing $1.25 / hour $1.25 / hour
Dynamic Behaviors
Batch Streaming
Fully Customizable Define exactly what Velma should detect in plain language — a name and a description — and scope each behavior to specific conversation types or speaker roles
Behavior Templates 146 ready-made behaviors to get you started, spanning safety, compliance, and customer experience — use any as-is or as a starting point for your own
Confidence Scoring
Reasoning per Detection
Evidence Clips Every detection links the clips it fired on, with a single definitive clip
Returned on Every Call
Batch Streaming
Diarized Transcription
Word-Level Timestamps
Language Detected per clip; optionally set a language hint for the conversation
Conversation Type Inferred for the session with a confidence score and reasoning — or set your own default
Participant Roles Per-speaker role inference (e.g. agent vs. caller) with confidence and reasoning
Topics
Topic Sentiment Per-speaker sentiment for each topic, scored from −1 to 1
Conversation Summary
Optional Signals
Batch Streaming
Emotion Detection
Accent Identification
Deepfake Score Per-clip synthetic-voice score from 0 to 1
PII/PHI Tagging

English Fast Transcription

English-only transcription optimized for speed. The batch API trades enrichment features for the lowest possible turnaround, and the streaming WebSocket emits rolling partial transcripts every ~1.5 seconds — built for live captions, voice assistants, and high-throughput pipelines.

Model Type Batch Streaming
Batch Streaming
Description High-throughput English batch processing with >200x real-time speed Low-latency English streaming transcription via WebSocket with real-time partial transcripts
API REST API WebSocket
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats
Concurrency Quota 25 concurrent requests 5 concurrent requests
Monthly Usage Quota 10,000 hours 10,000 hours
Pricing $0.025 / hour $0.05 / hour
Built-in features
Batch Streaming
Transcription
Auto Capitalization
Auto Punctuation
Language English English
Real-Time

Multilingual Transcription

Transcription in 70+ languages with per-utterance timing, speaker diarization, and optional emotion, accent, deepfake, and PII/PHI signals — available as a batch REST API and a real-time streaming WebSocket.

Model Type Batch Streaming
Batch Streaming
Description Multilingual batch transcription in 70+ languages with full feature set Real-time streaming transcription in 70+ languages via WebSocket
API REST API WebSocket
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM
Concurrency Quota 5 concurrent requests 5 concurrent requests
Monthly Usage Quota 1,000 hours 1,000 hours
Pricing $0.03 / hour $0.06 / hour
Built-in features
Batch Streaming
Transcription
Auto Capitalization
Auto Punctuation
Language 70+ languages 70+ languages
Real-Time
Optional features
Batch Streaming
Speaker Diarization
Emotion Detection
Accent Identification
PII/PHI Tagging $0.02 / hour $0.02 / hour
Deepfake Detection $0.25 / hour $0.25 / hour

Deepfake Detection

The Deepfake Detection models achieve state-of-the-art performance detecting speech deepfakes on single-speaker audio.

Model Type Batch Streaming
Batch Streaming
Description Speech deepfake prediction for audio files with a single speaker Real-time speech deepfake detection for streaming audio with a single speaker
Accuracy 98.9% average — #1 on Speech DF Arena 98.9% average — #1 on Speech DF Arena
API REST API WebSocket
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM WAV, AIFF, FLAC, MP3, AAC, OGG, WebM, and raw PCM (various bit depths and encodings)
Concurrency Quota 3 concurrent requests 3 concurrent requests
Monthly Usage Quota 1,000 hours 1,000 hours
Pricing $0.25 / hour $0.25 / hour
Built-in features
Batch Streaming
Per-Window Prediction
Confidence Scoring
Silence Detection
Flexible Chunk Size
Real-Time
First Prediction After full file upload At 500ms of audio

PII/PHI Redaction

PII/PHI Redaction (Personally Identifiable Information / Protected Health Information) transcribes audio in real time, detects and redacts sensitive entities from both the transcript text and the audio — returning a clean MP3 and a redacted transcript with 100+ entity types supported.

Model Type Batch Streaming
Batch Streaming
Description Transcribe an audio file and receive a redacted transcript and redacted MP3 with PII/PHI ranges silenced Real-time PII/PHI redaction via WebSocket — streams audio in, receives redacted transcript text and redacted MP3 clips as each utterance completes
API REST API WebSocket
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM AAC, AIFF, FLAC, MP3, OGG, WAV, WebM + raw PCM formats
Pricing $0.05 / hour $0.08 / hour
Outputs
Batch Streaming
Redacted Transcript
Redacted Audio (MP3)
Redaction Ranges Full list of silenced [start_ms, end_ms] ranges Per-utterance timing with each audio clip
PII/PHI Categories Detected
Batch Streaming
Coverage 100+ PII/PHI entity types across all categories
Names & Identifiers Full names, usernames, dates of birth
Government IDs SSN, passport, driver's license
Addresses & Locations Street addresses, ZIP codes, cities
Contact Details Phone numbers, email addresses
Financial Info Credit card numbers, bank accounts
Medical / Health Info (PHI) Diagnoses, medications, health record IDs
Built-in features
Batch Streaming
Speaker Diarization
Configurable Redaction Padding Adjustable silence buffer before and after each redacted range
Real-Time
First Result After full file upload At utterance completion

Music & Speech Detection

The Music & Speech Detection models classify audio at the frame level (~192ms resolution), returning independent music and speech probabilities for every frame. Useful for content moderation, ad-break detection, podcast segmentation, and audio analytics — available as both a low-latency batch API and a real-time streaming WebSocket.

Model Type Batch Streaming
Batch Streaming
Description Frame-level music and speech classification for audio files, with a primary label and percentage breakdown across the clip Real-time frame-level music and speech classification via WebSocket — frames are emitted progressively as audio streams in
API REST API WebSocket
Accepted Files AAC, FLAC, M4A, MP3, MP4, OGG, Opus, WAV AAC, AIFF, FLAC, MP3, OGG, WAV, WebM, and raw PCM (various bit depths and encodings)
Concurrency Quota 3 concurrent requests 3 concurrent requests
Monthly Usage Quota 1,000 hours 1,000 hours
Pricing $0.02 / hour $0.02 / hour
Built-in features
Batch Streaming
Music Detection
Speech Detection
Frame-Level Output Per-frame probabilities at ~192ms resolution
Non-Exclusive Labels Music and speech probabilities are independent — both can be high simultaneously (e.g. music with vocals)
Primary Label Overall clip classification: music, speech, or neither
Percentage Breakdown Percentage of clip where music / speech probability ≥ 0.5
Flexible Chunk Size Send audio in any chunk size that suits your pipeline
Real-Time
First Result After full file upload and processing After first ~192ms of audio

Language Detection

Language Detection identifies the spoken language of an audio file from 100 supported languages, returning a human-readable language name, an ISO 639-1 code, and a confidence score in a single synchronous request. Only the first 30 seconds of audio are analyzed.

Description Identify the spoken language of an audio file in a single synchronous request — returns the language name, its ISO code, and a confidence score
API REST API
Accepted Files AAC, AIFF, FLAC, MOV, MP3, MP4, OGG, Opus, WAV, WebM
Maximum File Size 100 MB
Analysis Window First 30 seconds of audio (longer files are accepted, but the additional audio is ignored)
Language Coverage Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba
Concurrency Quota 25 concurrent requests
Monthly Usage Quota 10,000 hours
Pricing $0.01 / hour
Outputs
Detected Language Human-readable name (e.g. English, French, Mandarin)
Language Code Lowercase ISO 639-1 code (e.g. en, fr, zh)
Confidence Score Probability from 0.0 to 1.0 for the predicted language
Audio Duration Total decoded audio duration in milliseconds