Bulk Transcribe Audio to Text Online Free

Transcribe hours of audio in seconds — two Whisper models, up to 216x realtime.

Drop up to 50 files at once — no install, no sign-up required.

Drop Audio or Video Files Here

100 MB or 1 hour per file Up to 50 files 3 parallel conversions 2 credits per minute

Encrypted AI-Powered Global Servers Auto-delete 1h

⚡ Median transcription time (last 10k jobs): 11.2s per minute

Outputs: TXT, PDF, DOCX, MD, EPUB, SRT, VTT Model: Fast

How it works

1 · Drop your files
Drag & drop audio or video files. Supports MP3, WAV, M4A, MP4, and more. No account required.
2 · We transcribe securely
Transcribed by OpenAI Whisper (choose Fast or Quality model), then formatted with ML paragraph intelligence. Encrypted in transit & at rest. Choose TXT, SRT, or VTT output.
3 · Download & auto-delete
Get your transcription in seconds. Files delete automatically after 1 hour.

Frequently Asked Questions

What audio and video formats are supported?

Audio formats: MP3, WAV, M4A, FLAC, OGG, OPUS, AAC

Video formats: MP4, WebM, MKV, MOV, AVI (we extract the audio track)

What output formats are available?

Choose the format that fits your use case:

TXT — Plain text with intelligent paragraph breaks (ML-powered)
SRT — Subtitles with timestamps for video players
VTT — Web captions for HTML5 video
PDF — Professional document with AI title for sharing/archiving
Word — Editable document with AI title
Markdown — Clean format for note-taking apps

What languages are supported?

Automatic language detection via OpenAI Whisper. Both Fast and Quality models support 99+ languages including:

English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Japanese, Korean, Chinese, and many more.

Mixed-language content is handled automatically.

How accurate is the transcription?

We offer two OpenAI Whisper models — Fast (Whisper V3 Large Turbo) for speed, and Quality (Whisper V3 Large) for maximum accuracy.

On clear recordings, it achieves near-human accuracy. Our ML post-processing structures the transcript into clean, readable paragraphs.

Results vary with audio quality — see our FAQ on accuracy limitations below.

Is the transcription guaranteed to be perfect?

No — and no transcription service can guarantee 100% accuracy.

Whisper achieves near-human accuracy on clear recordings in major languages. However, accuracy depends on:

Audio quality and background noise
Speaker clarity and accents
Multiple overlapping speakers
Domain-specific jargon and proper nouns

We recommend reviewing transcripts for critical use cases (legal, medical, published content).

How does the paragraph formatting work?

After transcription, we run your text through a machine learning model trained specifically for paragraph segmentation.

It analyzes sentence structure and topic flow to determine natural paragraph breaks — turning a wall of continuous text into clean, readable paragraphs.

This works automatically for text and document outputs (TXT, PDF, Word, Markdown, EPUB); SRT and VTT formats retain timestamped segments for subtitle synchronization.

How long can my audio files be?

Duration limits depend on your plan:

Free: 60 minutes (1 hours)
Pro: 300 minutes (5 hours)

This accommodates everything from short clips to full podcasts and lectures.

How much does transcription cost?

Transcription costs 2 credits per minute of audio.

A 30-minute podcast costs 60 credits.

Free: 50 credits daily (~25 minutes)
Pro: 12000+ credits monthly (~100 hours)

Which output format should I choose?

Plain Text — Raw transcript for scripts or simple reading
SRT/VTT — Subtitles for video (SRT for editors, VTT for web)
PDF — Fixed document to share or archive
Word — Editable for corrections and collaboration
Markdown — Notes apps (Obsidian, Notion)
EPUB — E-readers (Kobo, Apple Books)

What is the difference between Fast and Quality models?

Two OpenAI Whisper models — choose speed or accuracy:

Model	Engine	Speed	Cost
Fast	Whisper V3 Large Turbo (809M)	~216x realtime	2 credits/min
Quality	Whisper V3 Large (1.55B)	~189x realtime	5 credits/min

Fast is the default — great for clear audio, podcasts, and lectures.

Quality uses the full 1.55B-parameter model. Independent benchmarks show ~10% WER for Quality vs ~12% for Fast (Artificial Analysis). Choose Quality for accented speech, noisy recordings, or technical content.

Both models support 99+ languages. Switch in the options panel above.

Sources: Groq docs, Artificial Analysis benchmark, Hugging Face model cards.

What is Meeting Intelligence and how does it work?

Meeting Intelligence (speaker diarization) separates and labels different speakers in your audio.

When enabled, the transcript identifies who said what using labels like Speaker 1, Speaker 2, etc. Our AI post-processing then analyzes the context to infer actual names when possible — for example, if someone says "Hi, I'm John" or is addressed as "Thanks, Sarah."

Important: Name detection is not perfect. It works best when:

Speakers introduce themselves at the start
People are addressed by name during conversation
Audio quality is clear with minimal background noise

Meeting Intelligence costs extra credits and has lower file size limits due to the additional processing required.

How accurate is Meeting Intelligence?

Speaker separation (diarization) is highly accurate when speakers have distinct voices and don't talk over each other. The AI correctly identifies different speakers in most clear recordings.

Name inference is more variable. The AI post-processing attempts to extract names from the audio context, but accuracy depends on:

Whether speakers explicitly state their names
How clearly names are pronounced
Background noise and audio quality

If name detection misses someone, you'll see generic labels like "Speaker 1" — the transcript content is still accurate, just without the name attribution.

What are the limits for this converter?

Tier	Max File Size	Max Files/Batch	Parallel Processing
Guest/Free	100 MB	50 files	3 at once
Pro	1024 MB	1000 files	6 at once

Note: File size limits are specific to this converter. Batch and parallel processing limits apply to all images converters site-wide. See all converter limits →

How are credits calculated for this conversion?

Cost: 2 credits per minute

How it works:

Files up to 1 minutes: 2 credits
2 minutes: 4 credits
3 minutes: 6 credits
4 minutes: 8 credits

Example: A 10-minute file = 20 credits. A 180-minute (3h) audiobook = 360 credits.

Why per-minute? Audio conversion time scales with content duration, not file size. Longer audio requires proportionally more processing.

What are my daily and monthly credit limits?

Credit allocations vary by account tier:

Tier	Daily Limit	Monthly Limit
Guest	50 credits/day	—
Free	50 credits/day	—
Pro	—	12,000 credits/month

Daily credits (Guest & Free tiers) reset every day at midnight UTC. Monthly credits (Pro) reset on your billing cycle date.

Note: With 2 credit per minute, audio files under 1 MB cost 2 credit each. Pro users can convert 6,000 audio files per month.

Answers at a Glance

Quick answers to common questions.

All Transcription Formats

Dedicated converters with format-specific options:

SRT — Video subtitles VTT — Web captions PDF — Shareable document Word — Editable Markdown — Notes apps EPUB — E-readers

What's New in Audio to Text

Latest improvements to this converter

Last updated February 27, 2026

Feb 27, 2026

Now available via the Convert.FAST REST API.

Feb 9, 2026

Added Whisper V3 Large as a Quality mode for higher-accuracy transcription.

Jan 19, 2026

Added transcription mode selection: English (Fast) for English audio, Multilingual for 99 other languages.

Need to get more done? Pro starts from $5.

1 GB files 1,000 per batch Priority queue Web + API

See Pricing →

No subscription required.