Key Takeaways
- whisper.cpp is the best local STT choice for Apple Silicon. The C/C++ port leverages Core ML and Apple Metal for hardware acceleration β large-v3 at ~10Γ real-time on M5 Pro, with no Python dependency required.
- faster-whisper is the best local STT choice for NVIDIA GPUs and Python pipelines. CTranslate2 int8 quantization cuts VRAM by ~40% and boosts throughput by ~4Γ over the original OpenAI implementation β large-v3 at ~12Γ real-time on RTX 4070, using only ~2.5 GB VRAM.
- Both tools use identical Whisper model weights from OpenAI. WER (word error rate) is the same for both β the difference is entirely in runtime performance and integration pathway, not transcription accuracy.
- Whisper large-v3 gives the best accuracy at 2.5% WER on English. For most production use cases, Whisper small (3.4% WER, 2 GB RAM) or medium (2.9% WER, 5 GB RAM) offers a better speed-accuracy trade-off.
- Real-time transcription is achievable with both tools β whisper.cpp via its
--streamflag, faster-whisper via its built-in VAD (voice activity detection) pipeline. Practical latency is 0.5β2 seconds behind live speech depending on model size. - whisper.cpp runs on CPU, Metal, CUDA, and Vulkan β making it the only choice for cross-platform embedded use (Raspberry Pi, Windows GPU setups, ARM servers). faster-whisper supports CPU and CUDA only (no Metal on Mac).
- For Raspberry Pi and embedded Linux, whisper.cpp tiny/base on CPU is the practical ceiling β tiny at ~15Γ real-time on Pi 5, base at ~6Γ real-time. Both fit within 1 GB RAM.
Quick Facts
- Both tools: Based on OpenAI's open-source Whisper model (MIT license). Same accuracy β different runtimes.
- whisper.cpp: Written in C/C++ by ggerganov. Supports CPU (AVX2/NEON), CUDA, Metal (Apple), Vulkan. No Python required.
- faster-whisper: Python library using CTranslate2. Supports CPU (int8) and CUDA. No Apple Metal support.
- Whisper model sizes: tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.55B). All use the same ggml / CTranslate2 format.
- Best model for most use cases: Whisper small β 3.4% WER, runs in 2 GB RAM, 6Γ real-time on modern CPU.
- RTX 4070 benchmark (large-v3): faster-whisper ~12Γ real-time; whisper.cpp CUDA ~8Γ real-time. faster-whisper wins on NVIDIA.
- M5 Pro benchmark (large-v3): whisper.cpp Metal ~10Γ real-time; faster-whisper CPU-only ~3Γ real-time. whisper.cpp wins on Apple.
Why Local Speech-to-Text?
Cloud STT services (Google Speech-to-Text, AWS Transcribe, Azure Speech) charge per audio minute β typically $0.006β$0.024/minute β and send your audio to remote servers. For privacy-sensitive applications (medical dictation, legal recordings, journalist interviews, corporate meetings), local transcription eliminates the data exposure entirely.
- Privacy: Audio never leaves your machine. No data-processing agreement needed for GDPR compliance β processing happens locally.
- Cost: Zero per-minute fees. A developer transcribing 8 hours of meetings per week saves $120β480/month at cloud STT pricing.
- Offline: Works on planes, in secure facilities, in areas without reliable internet. No API key management.
- Latency: No upload/download round-trip. For real-time voice interfaces, local processing reduces STT latency from 300β800 ms (cloud) to 50β300 ms.
- Customization: Fine-tune on domain-specific vocabulary. Run any model size that fits your hardware.
Whisper Model Sizes β Foundation for Both Tools
Both whisper.cpp and faster-whisper use the same Whisper model weights, converted to their respective formats (GGML for whisper.cpp, CTranslate2 for faster-whisper). Choose your model size based on your VRAM/RAM budget and accuracy requirements.
| Model | Parameters | VRAM / RAM | English WER | Speed Factor (vs real-time on RTX 4070) |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | 7.6% | ~32Γ |
| base | 74M | ~1 GB | 5.0% | ~16Γ |
| small | 244M | ~2 GB | 3.4% | ~6Γ |
| medium | 769M | ~5 GB | 2.9% | ~2Γ |
| large-v3 | 1.55B | ~10 GB | 2.5% | 1Γ (baseline) |
| distil-large-v3 | ~756M | ~4 GB | ~2.6% | ~6Γ |
WER (word error rate) figures from the Whisper paper on the LibriSpeech clean test set. Lower is better. Speed factors for faster-whisper int8 on RTX 4070. distil-large-v3 figures from the Distil-Whisper paper.
Distil-Whisper: The Faster Alternative
distil-whisper/distil-large-v3 is a distilled variant of large-v3 with ~50% fewer parameters, running ~6Γ faster while keeping WER within ~1% of the original.** It is the right choice when transcription speed matters more than squeezing out the last fraction of accuracy. distil-large-v3 works with both faster-whisper (native CTranslate2 support) and whisper.cpp (via GGML format conversion), so it integrates into whichever runtime you already use.
- Parameters: ~756M β roughly half of large-v3's 1.55B, fitting in ~4 GB VRAM instead of ~10 GB.
- Speed: ~6Γ real-time on RTX 4070 (vs. 1Γ baseline for large-v3) β comparable to the medium model in speed, with large-v3-level accuracy.
- WER: ~2.6% on English β only ~0.1% higher than large-v3's 2.5%. In practice, the difference is inaudible on typical speech.
- Compatibility: Works with faster-whisper natively (
WhisperModel("distil-large-v3", device="cuda", compute_type="int8")). For whisper.cpp, convert to GGML format using the distil-whisper GGML conversion script. - Best for: Batch transcription jobs, server deployments with limited VRAM, and any use case where you want large-v3 quality at medium-model speed.
- Not for: Multilingual transcription β distil-large-v3 is English-only. For other languages, use large-v3 or medium.
whisper.cpp β The C/C++ Port
whisper.cpp (by Georgi Gerganov) is a pure C/C++ reimplementation of OpenAI's Whisper model, optimized for low-resource and cross-platform inference. It requires no Python, no CUDA toolkit, and runs on virtually any hardware β from Raspberry Pi to Apple M5 Pro to Windows CUDA setups.
- Platform support: CPU (AVX2, AVX512, ARM NEON), Apple Metal (Core ML), CUDA (NVIDIA), Vulkan (AMD/Intel GPU), OpenCL.
- Apple Silicon advantage: whisper.cpp exports models to Core ML format, enabling inference on the Apple Neural Engine. Large-v3 runs at ~10Γ real-time on M5 Pro via Metal β faster than any cloud round-trip.
- Installation: Clone the repo, run
make(orcmake). Pre-built binaries available for common platforms. No Python dependency. - Model download:
bash ./models/download-ggml-model.sh base.enβ downloads the GGML-format model file (~142 MB for base). - CLI example:
./main -m models/ggml-base.bin -f audio.wavβ transcribes a WAV file to stdout. Add-l defor German. - Real-time stream mode:
./stream -m models/ggml-base.bin --step 3000 --length 10000β transcribes from microphone in 3-second chunks. - Python wrapper: pywhispercpp provides a Python binding for whisper.cpp, enabling use in Python pipelines without sacrificing Metal acceleration.
- Limitation: No native VAD (voice activity detection). Stream mode requires tuning
--stepand--lengthparameters for your use case.
# Build from source (macOS / Linux)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j4
# Download a model
bash ./models/download-ggml-model.sh large-v3
# Transcribe a file
./main -m models/ggml-large-v3.bin -f recording.wav
# Enable Metal on Apple Silicon (Core ML)
make -j4 WHISPER_COREML=1
./main -m models/ggml-large-v3-encoder.mlmodelc -f recording.wavfaster-whisper β The CTranslate2 Port
faster-whisper (by SYSTRAN) is a Python library that reimplements Whisper inference using CTranslate2 β a highly optimized C++ inference engine that supports int8 quantization, reducing VRAM usage and increasing throughput. On NVIDIA GPUs, faster-whisper is the fastest local Whisper implementation available.
- Platform support: CPU (int8 quantization) and NVIDIA CUDA GPU. No Apple Metal support β runs CPU-only on Mac.
- int8 advantage: CTranslate2 int8 quantization reduces VRAM by ~40% and increases inference speed by ~2Γ vs float16, with negligible WER impact (< 0.1% absolute).
- Installation:
pip install faster-whisperβ no compilation required. CUDA support requires CUDA 11.8+ and cuDNN 8.x. - Built-in VAD: faster-whisper includes Silero VAD integration, which automatically skips silent audio segments β critical for real-time transcription pipelines.
- Python-native: Direct Python API makes it trivial to chain with LLMs, audio processing libraries, and web frameworks.
- Speed: large-v3 int8 on RTX 4070 runs at ~12Γ real-time and uses ~2.5 GB VRAM. CPU int8 achieves ~20Γ real-time for the tiny model.
- Batch processing: faster-whisper supports batched inference for processing large audio archives efficiently.
- Limitation: No Metal support on Mac β runs CPU-only on Apple Silicon, achieving ~3Γ real-time for large-v3 vs. whisper.cpp's ~10Γ with Metal.
from faster_whisper import WhisperModel
# Load model (downloads automatically on first run)
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
# Transcribe
segments, info = model.transcribe("audio.wav", beam_size=5)
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s β {segment.end:.2f}s] {segment.text}")Head-to-Head Benchmark Table
All benchmarks use the large-v3 model unless noted. Speed is measured in multiples of real-time (e.g., 10Γ means 60 minutes of audio transcribed in 6 minutes). VRAM figures for GPU runs; RAM figures for CPU runs.
π In One Sentence
On Apple Silicon, whisper.cpp with Metal runs large-v3 at ~10Γ real-time; on NVIDIA GPUs, faster-whisper with int8 runs at ~12Γ real-time β each tool wins decisively on its target platform.
π¬ In Plain Terms
Pick whisper.cpp on Mac (it uses the Apple Neural Engine), and faster-whisper on Windows/Linux with an NVIDIA GPU (it processes audio 12Γ faster than real-time while using 40% less GPU memory).
| Metric | whisper.cpp (large-v3) | faster-whisper (large-v3) |
|---|---|---|
| Platform / language | C/C++ (cross-platform) | Python (CTranslate2) |
| GPU support | CUDA, Metal, Vulkan | CUDA only |
| CPU optimization | AVX2, ARM NEON | int8 quantization |
| Speed β RTX 4070, large-v3 | ~8Γ real-time | ~12Γ real-time β |
| Speed β M5 Pro, large-v3 | ~10Γ real-time (Metal) β | ~3Γ real-time (CPU only) |
| Speed β CPU only (x86), base | ~15Γ real-time | ~20Γ real-time β |
| VRAM β large-v3, GPU | ~3 GB | ~2.5 GB (int8) β |
| Python integration | Needs wrapper (pywhispercpp) | Native β |
| VAD (silence detection) | Manual (--step tuning) | Built-in (Silero VAD) β |
| Real-time streaming | Yes (--stream flag) β | Yes (VAD pipeline) |
| WER accuracy (large-v3) | 2.5% (identical) | 2.5% (identical) |
| Python dependency | None β | Python 3.8+ |
| Raspberry Pi / embedded | Yes β C binary β | Limited β Python overhead |
| Output formats | SRT, VTT, JSON, CSV, txt | Python objects (start, end, text) |
whisper.cpp writes output directly to standard subtitle and transcript file formats (SRT, VTT, JSON, CSV, txt) β ideal for subtitle workflows where you need a file on disk with no additional code. faster-whisper yields a Python generator of segment objects with start, end, and text attributes β ideal for LLM pipeline chaining, where you pass segment text directly into a downstream model without writing intermediate files. For subtitle generation, whisper.cpp is simpler. For pipelines that process segments programmatically, faster-whisper is simpler.
Real-Time Transcription Setup
Real-time transcription processes audio in chunks as it arrives from a microphone, producing text with a short lag behind speech. Both tools support this, but with different trade-offs.
- whisper.cpp stream mode: Run
./stream -m models/ggml-small.bin --step 3000 --length 10000 -t 4. Processes 3-second audio chunks; ~0.5β1.5 second lag with the small model. No Python needed. - faster-whisper VAD pipeline: Use
vad_filter=Trueinmodel.transcribe(). Silero VAD automatically segments audio at silence boundaries β more natural chunks than fixed-length windows. - Practical latency: 0.5β2 seconds behind live speech with small or medium models. Use tiny for the lowest latency (< 0.5 seconds, but higher WER).
- Model selection for real-time: small or base is the practical sweet spot β fast enough to keep up with speech, accurate enough for clean audio. Avoid large-v3 for real-time unless you have a dedicated GPU.
- Microphone input: whisper.cpp reads raw audio via SDL2 or portaudio. faster-whisper reads audio arrays from any Python audio library (sounddevice, pyaudio, soundfile).
- Stability: whisper.cpp stream mode can produce repeated tokens ("hallucinate" short fillers) on silence. Suppress with
--suppress-blankand--no-speech-threshold.
Apple Silicon: whisper.cpp Wins
On M1, M2, M3, M4, and M5 Macs, whisper.cpp with Core ML / Metal acceleration is the correct tool β no question. faster-whisper has no Metal support and runs CPU-only on Mac, achieving roughly 3Γ real-time for large-v3. whisper.cpp with Metal achieves ~10Γ real-time on M5 Pro β a 3Γ speed advantage.
- Core ML export: Run
./models/generate-coreml-model.sh large-v3to export the encoder to Core ML format. This offloads encoder inference to the Apple Neural Engine. - M5 Pro benchmark (large-v3, Metal): ~10Γ real-time. 60 minutes of audio transcribes in ~6 minutes. Note: M5 Pro shipped March 2026 β these are early community benchmarks. Performance may improve with whisper.cpp updates optimizing for the M5 Neural Engine.
- M3 MacBook Air benchmark (large-v3, Metal): ~7Γ real-time. 60 minutes in ~8.5 minutes.
- Memory: Unified memory means no separate VRAM β a 16 GB M5 Pro can comfortably run large-v3 (~3 GB) alongside other processes.
- faster-whisper on Mac: CPU-only, int8. Large-v3 at ~3Γ real-time. Usable for batch overnight transcription but not for real-time or time-sensitive workflows.
- Recommendation: Use whisper.cpp for all Mac STT work. Add pywhispercpp if you need Python integration while retaining Metal acceleration.
NVIDIA GPU: faster-whisper Wins
On Windows and Linux with NVIDIA GPUs, faster-whisper is the superior choice. Its CTranslate2 CUDA backend is more optimized than whisper.cpp's CUDA path β ~12Γ vs. ~8Γ real-time for large-v3 on RTX 4070, with lower VRAM usage.
- RTX 4070 (12 GB) benchmark (large-v3 int8): ~12Γ real-time, ~2.5 GB VRAM.
- RTX 3060 (12 GB) benchmark (large-v3 int8): ~8Γ real-time, ~2.5 GB VRAM.
- RTX 4060 (8 GB) benchmark (large-v3 int8): ~7Γ real-time, ~2.5 GB VRAM β easily fits.
- int8 vs float16: int8 is ~2Γ faster and uses ~40% less VRAM with negligible accuracy loss. Always use
compute_type="int8"on NVIDIA. - Batch processing: faster-whisper's
batched=Trueparameter enables parallel processing of multiple audio files, maximizing GPU utilization for large transcription jobs. - Python pipeline integration: faster-whisper slots directly into LangChain, Haystack, and custom Python pipelines. No subprocess overhead vs. wrapping whisper.cpp.
When to Use Which
A direct mapping from your scenario to the right tool:
π In One Sentence
Use whisper.cpp on Apple Silicon and embedded/cross-platform targets; use faster-whisper on NVIDIA GPUs and Python pipelines.
π¬ In Plain Terms
If you have a Mac, pick whisper.cpp β it's 3Γ faster than faster-whisper on Apple hardware. If you have an NVIDIA GPU and write Python, pick faster-whisper β it's faster and needs 40% less GPU memory.
| Scenario | Best Choice | Why |
|---|---|---|
| Apple Silicon Mac (any model) | whisper.cpp | Metal / Core ML acceleration β 3Γ faster than faster-whisper (CPU-only on Mac) |
| NVIDIA GPU server (Linux/Windows) | faster-whisper | CTranslate2 int8 β faster and lower VRAM than whisper.cpp CUDA path |
| Python data pipeline | faster-whisper | Native Python API; no subprocess wrapper; VAD built in |
| Raspberry Pi / embedded Linux | whisper.cpp | Pure C binary; no Python runtime overhead; ARM NEON optimized |
| Real-time voice assistant | whisper.cpp | Stream mode with low overhead; works without Python on Pi / embedded |
| Batch transcription (large audio archive) | faster-whisper | Batched inference, GPU utilization, Python async integration |
| AMD GPU (Vulkan) | whisper.cpp | Vulkan backend support; faster-whisper is CUDA-only |
| CPU-only Linux server | faster-whisper | int8 quantization gives ~30% speed advantage on x86 CPU |
Beyond whisper.cpp and faster-whisper
Two additional tools extend Whisper with capabilities that neither whisper.cpp nor faster-whisper provide out of the box: speaker diarization and extreme-speed batch GPU inference.
- WhisperX:** Built on top of faster-whisper, WhisperX adds word-level timestamps and speaker diarization β identifying which speaker said which words. Best for meeting transcription with speaker labels, podcast editing, and interview transcripts. Install with
pip install whisperxand provide a Hugging Face token for the diarization model. - insanely-fast-whisper:** A Hugging Face Transformers pipeline wrapper that adds Flash Attention 2 support for significantly faster GPU inference than standard faster-whisper on NVIDIA hardware. Best for batch processing large audio archives on NVIDIA GPUs. Requires a Flash Attention 2-compatible GPU (Ampere or newer: RTX 3000+, A100, H100).
Common Issues and Fixes
The most frequent setup and runtime problems, with direct fixes:
- CUDA version mismatch: faster-whisper requires CUDA 11.8 or later. Check with
nvcc --version. If your CUDA is older, either upgrade the driver or install faster-whisper in a conda environment withcudatoolkit=11.8. - Metal model export fails: Ensure Xcode Command Line Tools are installed β run
xcode-select --install. The Core ML export script requires thecoremltoolsPython package:pip install coremltools. - Hallucination on silence: Both tools can produce repeated filler tokens on silent audio segments. Use
--no-speech-threshold 0.6in whisper.cpp stream mode, orvad_filter=Truein faster-whisper'smodel.transcribe()to skip silent segments automatically. - Out of memory on large-v3: Switch to int8 quantization in faster-whisper (
compute_type="int8") β reduces VRAM from ~5 GB (float16) to ~2.5 GB. In whisper.cpp, use the quantized GGML variant (e.g.,ggml-large-v3-q5_0.bin) which cuts memory to ~3β4 GB. - Garbled output on non-English audio: Do not use
.enmodel variants (tiny.en, base.en) for non-English speech β they are English-only. Use the multilingual models (base, small, medium, large-v3) and set the language explicitly:-l dein whisper.cpp orlanguage="de"in faster-whisper. - Slow CPU inference: Ensure your CPU supports AVX2 instructions (required for optimized CPU inference). Check with
grep avx2 /proc/cpuinfoon Linux orsysctl machdep.cpu.featureson Mac. CPUs without AVX2 fall back to generic SIMD and will be 2β3Γ slower.
FAQ
Is the transcription accuracy the same between whisper.cpp and faster-whisper?
Yes. Both tools use the same OpenAI Whisper model weights β the model itself is identical. The difference is only in the inference runtime (C/C++ vs CTranslate2 Python). WER on the same audio file will be within 0.1% absolute of each other, which is within normal variation from beam search randomness.
Can I use faster-whisper on a Mac with Apple Silicon?
Yes, but it runs CPU-only β faster-whisper has no Metal support. On an M5 Pro, faster-whisper large-v3 runs at ~3Γ real-time (CPU int8), compared to whisper.cpp's ~10Γ real-time with Metal. For most Mac users, whisper.cpp is 3Γ faster for the same model. The only reason to use faster-whisper on Mac is if your Python pipeline already depends on it and speed is not critical.
What Whisper model size should I use for a voice assistant?
For real-time voice interfaces, Whisper small is the standard recommendation β 3.4% WER on clean English, ~200 ms STT latency on a modern CPU or GPU, and fits in 2 GB RAM. Use tiny if you are on very constrained hardware (Raspberry Pi Zero 2W, older phones) and can tolerate ~7.6% WER. Use medium or large-v3 only for batch transcription where latency is not a constraint.
Does whisper.cpp support languages other than English?
Yes. All Whisper multilingual models (base, small, medium, large-v3) support 99 languages. Add `-l [language code] to the CLI: -l de for German, -l fr for French, -l ja` for Japanese, etc. The tiny.en and base.en models are English-only and slightly more accurate for English than their multilingual equivalents.
How do I install faster-whisper with CUDA support?
Install with pip install faster-whisper. CUDA support requires CUDA 11.8 or later and cuDNN 8.x installed on your system. Verify your CUDA version with nvcc --version. Then specify device="cuda" when loading the model: WhisperModel("large-v3", device="cuda", compute_type="int8"). If CUDA is not detected, faster-whisper falls back to CPU automatically.
Which is more accurate β whisper.cpp or faster-whisper?
Identical. Both tools use the same OpenAI Whisper model weights and produce the same WER on any given audio file. The difference between whisper.cpp and faster-whisper is speed and platform support, not transcription accuracy. Any WER difference you measure between runs is within normal variation from beam search, not from the runtime itself.
Can I run Whisper large-v3 on 8 GB RAM?
Yes on GPU β large-v3 int8 in faster-whisper uses ~2.5 GB VRAM and runs on any 8 GB GPU. On CPU-only hardware, 8 GB RAM is tight for large-v3 (float32 uses ~10 GB). Use medium (5 GB RAM) or small (2 GB RAM) on CPU-only systems. whisper.cpp is more memory-efficient on CPU than faster-whisper due to lower runtime overhead.
How much does local Whisper cost vs cloud STT?
Zero ongoing cost. Cloud STT services charge $0.006β$0.024 per audio minute β for a developer transcribing 8 hours of meetings per week, that's $120β480/month. Local Whisper runs on hardware you already own, with no per-minute fees, no API key management, and no audio data leaving your machine.
Sources
- whisper.cpp on GitHub β Source, build instructions, model download scripts, and Metal/Core ML setup guide.
- faster-whisper on GitHub β Source, Python API documentation, and benchmark results.
- distil-whisper/distil-large-v3 on Hugging Face β Model card, benchmark results, and usage instructions for the distilled Whisper variant.
- WhisperX on GitHub β Word-level timestamps and speaker diarization built on faster-whisper.
- insanely-fast-whisper on GitHub β Flash Attention 2 Whisper pipeline for maximum NVIDIA GPU throughput.
- OpenAI Whisper on GitHub β Original Whisper model, paper, and model cards for all sizes.
- OpenAI Whisper paper (Radford et al., 2022) β "Robust Speech Recognition via Large-Scale Weak Supervision." Source of WER figures.
- CTranslate2 documentation β Quantization details, hardware support, and int8 optimization rationale.