Mistral AI launched Voxtral Transcribe 2 on February 4, 2026. This speech-to-text model family runs directly on your device. You get accurate transcription without sending audio to remote servers.
The system includes two models. Voxtral Mini Transcribe V2 handles batch processing. Voxtral Realtime processes live audio with under 200 milliseconds of delay. Both models support 13 languages and work in healthcare, finance, and defense industries where data privacy matters.
This technology solves a key problem. Most speech recognition systems send your audio to cloud servers. Voxtral keeps everything local. Your voice data stays on your laptop, phone, or smartwatch. Companies in regulated industries can now use AI transcription without privacy risks.
The Problem With Cloud-Based Transcription
Traditional speech-to-text services require internet connections. Your audio files travel to remote servers. This creates three problems.
First, you lose control of sensitive data. Healthcare conversations, financial discussions, and legal meetings contain private information. Sending this data to third-party servers creates compliance risks.
Second, you depend on network availability. Poor internet connections cause delays or failures. Cloud services can experience outages that stop your work.
Third, costs add up quickly. Cloud transcription services charge per minute. Large volumes of audio become expensive. Meeting recordings, customer calls, and interviews create substantial bills.
How Voxtral Transcribe 2 Works
Voxtral uses a streaming architecture. The model processes audio as it arrives instead of waiting for complete files. This design enables real-time performance.
The Realtime model contains 4 billion parameters. This includes a 3.4 billion parameter language model and a 0.6 billion parameter audio encoder. Both components use sliding window attention. This allows unlimited streaming without memory constraints.
The audio encoder processes sound using causal attention. It can only look at past audio, not future sounds. This enables true real-time operation. The model produces transcriptions 80 milliseconds after hearing each sound.
You can adjust transcription delay from 80 milliseconds to 2.4 seconds. Lower delays give faster responses. Higher delays improve accuracy. At 480 milliseconds, the model matches offline transcription quality.
Two Models For Different Needs
| Feature | Mini Transcribe V2 | Realtime |
|---|---|---|
| Purpose | Batch processing | Live transcription |
| Latency | Standard | 80ms to 2.4s (configurable) |
| Price | $0.003/minute | $0.006/minute |
| License | Proprietary API | Apache 2.0 open-weights |
| Speaker ID | Yes | No |
| Context Biasing | Up to 100 phrases | Not available |
| Word Timestamps | Yes | Yes |
| Deployment | API only | On-device or cloud |
Voxtral Mini Transcribe V2 excels at batch transcription. You upload pre-recorded files through the Mistral API. The model returns text with speaker labels and timestamps. It identifies who said what and when.
This model includes context biasing. You provide up to 100 specialized words or phrases. The system learns correct spellings for names, technical terms, and industry jargon. This feature works best in English but supports other languages experimentally.
Voxtral Realtime focuses on speed. It transcribes audio as people speak. The model runs on your hardware under an Apache 2.0 license. You can download the weights from Hugging Face and deploy them anywhere.
Accuracy Benchmarks
Mistral tested both models on FLEURS, a multilingual speech benchmark. Voxtral Mini Transcribe V2 achieved approximately 4% word error rate across the top 10 languages.
| Competitor | Word Error Rate (FLEURS) |
|---|---|
| Voxtral Mini Transcribe V2 | ~4% |
| GPT-4o mini Transcribe | Higher than Voxtral |
| Gemini 2.5 Flash | Higher than Voxtral |
| Assembly Universal | Higher than Voxtral |
| Deepgram Nova | Higher than Voxtral |
The Realtime model performs differently based on delay settings. At 480 milliseconds, it matches leading offline models and real-time APIs. At 2.4 seconds, it reaches the same accuracy as Mini Transcribe V2.
Voxtral processes audio three times faster than ElevenLabs Scribe v2. The cost is one-fifth of major competitors. This combination of speed, accuracy, and price makes Voxtral competitive.
Supported Languages
Both models work with 13 languages:
- English
- Chinese (Mandarin)
- Hindi
- Spanish
- Arabic
- French
- Portuguese
- Russian
- German
- Japanese
- Korean
- Italian
- Dutch
Mistral claims non-English performance exceeds competitors. The models handle accents and regional variations. Testing shows strong results across European and Asian languages.
Speaker Diarization Explained
Diarization identifies different speakers in audio recordings. Voxtral Mini Transcribe V2 labels each speaker and marks when they start and stop talking.
The output looks like this:
[00:00:15 - 00:00:23] Speaker A: "We need to review the quarterly results."
[00:00:24 - 00:00:31] Speaker B: "I have the numbers ready for you."
This feature works for meetings, interviews, and customer calls. You get clean attribution without manual editing. The model handles most scenarios but transcribes overlapping speech as a single speaker.
Diarization uses the FLEURS benchmark for testing. Voxtral achieved the lowest diarization error rate among tested models. This means fewer mistakes in identifying who spoke.
Privacy-First Architecture
Voxtral Realtime runs entirely on your device. Audio never leaves your hardware. This design supports GDPR and HIPAA compliance.
Healthcare providers can transcribe patient conversations locally. Financial institutions process sensitive calls without cloud transmission. Defense contractors maintain security clearances while using speech recognition.
The model requires minimal resources. A single GPU with 16GB memory runs Voxtral in real time. Laptops and workstations handle the processing. Some implementations work on smartphones.
This approach contrasts with cloud services. Companies avoid data sovereignty concerns. You control where transcriptions happen and who accesses them.
Enterprise Use Cases
Customer Service Centers
Voxtral Realtime transcribes calls as they happen. Support agents see text appear on screens while customers speak. The system can pull up relevant information before customers finish explaining problems.
This reduces interaction time. Agents solve issues in two exchanges instead of multiple back-and-forth conversations. Customers get faster resolutions.
Meeting Intelligence
Upload recordings to Voxtral Mini Transcribe V2. Receive transcripts with speaker labels and timestamps. The system identifies action items and key discussions.
Context biasing ensures accurate transcription of employee names, product terms, and company-specific vocabulary. Teams get searchable meeting archives without manual note-taking.
Live Subtitling
Broadcast and media companies use Voxtral Realtime for subtitles. The sub-200 millisecond latency keeps text synchronized with video. Viewers see captions appear immediately.
The system handles technical jargon through context biasing. Sports terminology, scientific language, and specialized fields receive accurate transcription.
Compliance and Audits
Financial and healthcare organizations need transcription records. Voxtral provides word-level timestamps and speaker identification. Compliance teams can verify conversations and create audit trails.
On-device processing keeps sensitive information secure. No third party accesses recorded conversations. Organizations meet regulatory requirements while using AI technology.
Getting Started With Voxtral
Using the API
Voxtral Mini Transcribe V2 works through the Mistral API. Send audio files with a simple curl command:
curl -X POST "https://api.mistral.ai/v1/audio/transcriptions" \
-H "Authorization: Bearer $MISTRAL_API_KEY" \
-F model="voxtral-mini-latest" \
-F file=@"your-audio.m4a" \
-F diarize=true \
-F context_bias="Datasette,WebAssembly" \
-F timestamp_granularities="segment"
The API accepts MP3, WAV, M4A, FLAC, and OGG files up to 1GB. You can upload files up to 3 hours long.
Testing in Mistral Studio
Mistral provides an audio playground in Mistral Studio. Upload up to 10 files at once. Toggle diarization on or off. Select timestamp detail levels. Add context bias terms for specialized vocabulary.
The playground shows results immediately. You can test accuracy with your actual audio before committing to production deployment.
Deploying Realtime On-Device
Download Voxtral Realtime from Hugging Face. The model uses vLLM for serving. Install and configure with these commands:
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
--compilation_config '{"cudagraph_mode": "PIECEWISE"}'
Adjust max-num-batched-tokens to balance throughput and latency. Higher values increase throughput but add latency. Reduce max-model-len if you transcribe shorter audio to save memory.
The model uses a default max-model-len of 131,072 tokens (over 3 hours of audio). One text token represents 80 milliseconds of audio.
Configuring Transcription Delay
Voxtral Realtime lets you set delay between 80 milliseconds and 2.4 seconds. Edit the tekken.json file and change the transcription_delay_ms parameter.
Use multiples of 80 milliseconds only. Common settings:
| Delay | Use Case |
|---|---|
| 80-200ms | Voice assistants, interactive agents |
| 480ms | Recommended balance of speed and accuracy |
| 2.4s | Subtitle generation, highest accuracy |
Lower delays prioritize responsiveness. Higher delays improve transcription quality. Test different settings with your audio to find the optimal balance.
Common Mistakes To Avoid
Forcing Context Bias in Non-English
Context biasing works best in English. Other languages receive experimental support. Don't rely on context bias for non-English transcription accuracy. Test thoroughly before production use.
Expecting Diarization From Realtime
Only Mini Transcribe V2 provides speaker diarization. Voxtral Realtime focuses on fast, accurate transcription without speaker identification. Choose the right model for your needs.
Ignoring Network Requirements for API
The Mistral API requires internet connectivity. Voxtral Mini Transcribe V2 runs in the cloud. Only Voxtral Realtime supports truly offline, on-device operation.
Overlooking Delay Configuration
Default settings may not suit your application. Interactive voice agents need minimal delay. Subtitle generation benefits from higher delays. Configure appropriately for your use case.
Processing Extremely Long Audio
The 3-hour maximum applies to single requests. Break longer recordings into segments. Process them separately and combine results.
Cost Comparison
| Service | Price per Hour |
|---|---|
| Voxtral Mini Transcribe V2 | $0.18 |
| Voxtral Realtime | $0.36 |
| OpenAI Whisper API | ~$0.36 |
| Google Speech-to-Text | ~$1.44 |
| Amazon Transcribe | ~$1.44 |
Voxtral costs significantly less than major cloud providers. High-volume users save substantially. A company transcribing 1,000 hours monthly saves $1,080 compared to Google or Amazon.
Voxtral Realtime costs more than Mini Transcribe V2 but enables on-device deployment. Organizations concerned about data privacy often find this worth the premium.
Technical Requirements
For Voxtral Realtime
- Single GPU with 16GB+ memory
- vLLM runtime environment
- BF16 precision support
- Linux operating system recommended
For Voxtral Mini Transcribe V2
- Internet connection
- Mistral API key
- Audio files in supported formats (MP3, WAV, M4A, FLAC, OGG)
The Realtime model achieves throughput exceeding 12.5 tokens per second. This enables smooth real-time operation on capable hardware.
Best Practices
Optimize Context Bias Lists
Include only essential terms. The 100-phrase limit covers most specialized vocabulary. Prioritize proper nouns, technical terms, and domain-specific language that standard models miss.
Test context bias effectiveness. Add phrases that actually improve results. Remove entries that don't help.
Choose Appropriate Timestamp Granularity
Word-level timestamps enable subtitle generation and audio search. Segment-level timestamps work for meeting summaries. Select based on your downstream needs.
Finer granularity increases output size. Use word-level only when necessary.
Handle Overlapping Speech
Voxtral transcribes one speaker during overlaps. Design workflows that account for this limitation. Consider using Mini Transcribe V2 for complex multi-speaker scenarios with frequent interruptions.
Test With Your Actual Audio
Benchmark performance using your real-world recordings. Industry-specific content, accents, and acoustic environments affect accuracy. Public benchmarks don't always predict performance on your data.
Balance Latency and Accuracy
Start with the recommended 480ms delay for Realtime. Adjust based on application requirements. Voice agents may need lower latency. Compliance transcription benefits from higher accuracy.
Open Source Advantage
Voxtral Realtime ships under Apache 2.0 license. You can modify the model for your needs. No licensing fees or usage restrictions apply.
This enables:
- Custom training on specialized vocabulary
- Integration with proprietary systems
- Deployment in air-gapped environments
- Modification for specific acoustic conditions
- Distribution in commercial products
The open-weights approach gives developers complete control. You're not locked into a vendor's ecosystem or pricing structure.
Future of On-Device Speech AI
Mistral positions Voxtral as foundation technology. The company aims for real-time speech-to-speech translation that feels natural. Current models focus on transcription. Future versions may handle direct translation between languages.
On-device processing becomes more important as privacy regulations expand. GDPR in Europe, CCPA in California, and sector-specific rules like HIPAA drive demand for local AI processing.
Smaller models that run efficiently on consumer hardware enable new applications. Voice agents, live translation, and meeting intelligence become accessible without powerful servers or cloud dependencies.
Conclusion
Voxtral Transcribe 2 brings enterprise-grade speech recognition to your local hardware. The 4% word error rate matches cloud services. The $0.003 per minute pricing undercuts major competitors. The Apache 2.0 license gives you complete control.
Choose Mini Transcribe V2 for batch processing with speaker identification. Use Voxtral Realtime for live transcription with minimal latency. Both models support 13 languages and handle challenging acoustic environments.
Start testing in Mistral Studio's audio playground. Upload your recordings and evaluate accuracy. Download Realtime from Hugging Face if you need on-device deployment. The combination of performance, price, and privacy makes Voxtral worth considering for any organization handling sensitive voice data.
