Bulk Transcribe Audio to VTT Captions Online Free
Generate WebVTT captions for HTML5 & web video — two Whisper models, up to 216x realtime.
Drop up to 50 files at once — no install, no sign-up required.
Drop Audio or Video Files Here
Encrypted AI-Powered Global Servers Auto-delete 1h
How it works
- 1 · Drop your files
Drag & drop audio or video files. Supports MP3, WAV, M4A, MP4, and more. No account required.
- 2 · We generate captions
Transcribed by OpenAI Whisper (choose Fast or Quality model) with accurate timestamps. Encrypted in transit & at rest.
- 3 · Download & auto-delete
Get your VTT caption file in seconds. Files delete automatically after 1 hour.
Frequently Asked Questions
What is the VTT format?
WebVTT (Web Video Text Tracks) is the native caption format for HTML5 video.
VTT files are used with the <track> element in web browsers and support:
- CSS styling and positioning
- Metadata and cue settings
- Native browser support
It's the modern standard for web video captions.
VTT vs SRT — which should I use?
Use VTT for web video — HTML5 players, browsers, and web apps. It's the native web standard.
Use SRT for everything else — video editors, desktop players (VLC), YouTube/Vimeo uploads.
- VTT — CSS styling, positioning, modern browsers
- SRT — Broader legacy support, universal compatibility
Need legacy compatibility? Try Audio to SRT
How do I add VTT captions to my website?
Add the VTT file using the HTML5 <track> element:
<video>
<track src="captions.vtt" kind="captions" srclang="en">
</video>The browser handles caption display automatically.
Works in all modern browsers: Chrome, Firefox, Safari, and Edge.
How accurate are the timestamps?
Our AI generates word-level timestamps using OpenAI Whisper — one of the most accurate speech recognition models available.
Timestamps are typically accurate to within a few hundred milliseconds.
This precision is ideal for synchronized web captions where audio and text must align perfectly.
What is the difference between Fast and Quality models?
Two OpenAI Whisper models — choose speed or accuracy:
| Model | Engine | Speed | Cost |
|---|---|---|---|
| Fast | Whisper V3 Large Turbo (809M) | ~216x realtime | 2 credits/min |
| Quality | Whisper V3 Large (1.55B) | ~189x realtime | 5 credits/min |
Fast is the default — great for clear audio, podcasts, and lectures.
Quality uses the full 1.55B-parameter model. Independent benchmarks show ~10% WER for Quality vs ~12% for Fast (Artificial Analysis). Choose Quality for accented speech, noisy recordings, or technical content.
Both models support 99+ languages. Switch in the options panel above.
Sources: Groq docs, Artificial Analysis benchmark, Hugging Face model cards.
How does Meeting Intelligence appear in VTT captions?
Meeting Intelligence adds speaker labels to your VTT captions, making it clear who's speaking in multi-speaker web videos like interviews, webinars, or training content.
When enabled, speaker names or labels appear as voice tags in the VTT:
WEBVTT
00:00:01.000 --> 00:00:04.000
Welcome to the webinar.
00:00:04.500 --> 00:00:07.000
Thanks for having me. These voice tags enable CSS styling per speaker and improve accessibility for deaf and hard-of-hearing viewers.
Our AI post-processing attempts to identify speakers by name when introduced or addressed in the audio. Important: Name detection isn't perfect — it works best when speakers introduce themselves ("Hi, I'm John from Engineering...") or are called by name during the conversation. If names aren't detected, you'll see generic labels like "Speaker 1" and "Speaker 2."
Meeting Intelligence costs extra credits and is ideal for professional web content where speaker attribution improves clarity.
What are the limits for this converter?
| Tier | Max File Size | Max Files/Batch | Parallel Processing |
|---|---|---|---|
| Guest/Free | 100 MB | 50 files | 3 at once |
| Pro | 1024 MB | 1000 files | 6 at once |
Note: File size limits are specific to this converter. Batch and parallel processing limits apply to all images converters site-wide. See all converter limits →
How are credits calculated for this conversion?
Cost: 2 credits per minute
How it works:
- Files up to 1 minutes: 2 credits
- 2 minutes: 4 credits
- 3 minutes: 6 credits
- 4 minutes: 8 credits
Example: A 10-minute file = 20 credits. A 180-minute (3h) audiobook = 360 credits.
Why per-minute? Audio conversion time scales with content duration, not file size. Longer audio requires proportionally more processing.
What are my daily and monthly credit limits?
Credit allocations vary by account tier:
| Tier | Daily Limit | Monthly Limit |
|---|---|---|
| Guest | 100 credits/day | — |
| Free | 100 credits/day | — |
| Pro | — | 12,000 credits/month |
Daily credits (Guest & Free tiers) reset every day at midnight UTC. Monthly credits (Pro) reset on your billing cycle date.
Note: With 2 credit per minute, audio files under 1 MB cost 2 credit each. Pro users can convert 6,000 audio files per month.
Answers at a Glance
Quick answers to common questions.
- Are my files secure?
- How long do you keep my files?
- What metadata do you keep?
- What happens after I drop a file?
- Why are conversions so fast?
- How do you measure performance?
- What are the exact limits for each plan?
- Can I process files in bulk?
- Why did my file fail to convert?
- Do you use my files to train AI?
Other Transcription Formats
Need a different output format for your transcription?
What's New in Audio to VTT
Latest improvements to this converter
Added Whisper V3 Large as a Quality mode for higher-accuracy transcription.
Launched Audio to VTT transcription with web captions.
Need to get more done? Pro starts from $5.
No subscription required.