Key Takeaways
- whisper.cpp is the best local STT choice for Apple Silicon. The C/C++ port leverages Core ML and Apple Metal for hardware acceleration β large-v3 at ~10Γ real-time on M5 Pro, with no Python dependency required.
- faster-whisper is the best local STT choice for NVIDIA GPUs and Python pipelines. CTranslate2 int8 quantization cuts VRAM by ~40% and boosts throughput by ~4Γ over the original OpenAI implementation β large-v3 at ~12Γ real-time on RTX 4070, using only ~2.5 GB VRAM.
- Both tools use identical Whisper model weights from OpenAI. WER (word error rate) is the same for both β the difference is entirely in runtime performance and integration pathway, not transcription accuracy.
- Whisper large-v3 gives the best accuracy at 2.5% WER on English. For most production use cases, Whisper small (3.4% WER, 2 GB RAM) or medium (2.9% WER, 5 GB RAM) offers a better speed-accuracy trade-off.
- Real-time transcription is achievable with both tools β whisper.cpp via its
--streamflag, faster-whisper via its built-in VAD (voice activity detection) pipeline. Practical latency is 0.5β2 seconds behind live speech depending on model size. - whisper.cpp runs on CPU, Metal, CUDA, and Vulkan β making it the only choice for cross-platform embedded use (Raspberry Pi, Windows GPU setups, ARM servers). faster-whisper supports CPU and CUDA only (no Metal on Mac).
- For Raspberry Pi 5 and embedded Linux, whisper.cpp tiny and base on CPU are the real-time-capable models β tiny runs comfortably faster than real-time, base around real-time with
-t 4. The small model (3.4% WER) runs below real-time on the Pi 5 and is best for batch jobs. All inference is CPU (ARM NEON) β the Pi has no usable GPU path for Whisper.
Quick Facts
- Both tools: Based on OpenAI's open-source Whisper model (MIT license). Same accuracy β different runtimes.
- whisper.cpp: Written in C/C++ by ggerganov. Supports CPU (AVX2/NEON), CUDA, Metal (Apple), Vulkan. No Python required.
- faster-whisper: Python library using CTranslate2. Supports CPU (int8) and CUDA. No Apple Metal support.
- Whisper model sizes: tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.55B). All use the same ggml / CTranslate2 format.
- Best model for most use cases: Whisper small β 3.4% WER, runs in 2 GB RAM, 6Γ real-time on modern CPU.
- RTX 4070 benchmark (large-v3): faster-whisper ~12Γ real-time; whisper.cpp CUDA ~8Γ real-time. faster-whisper wins on NVIDIA.
- M5 Pro benchmark (large-v3): whisper.cpp Metal ~10Γ real-time; faster-whisper CPU-only ~3Γ real-time. whisper.cpp wins on Apple.
Why Local Speech-to-Text?
Cloud STT services (Google Speech-to-Text, AWS Transcribe, Azure Speech) charge per audio minute β typically $0.006β$0.024/minute β and send your audio to remote servers. For privacy-sensitive applications (medical dictation, legal recordings, journalist interviews, corporate meetings), local transcription eliminates the data exposure entirely.
- Privacy: Audio never leaves your machine. No data-processing agreement needed for GDPR compliance β processing happens locally.
- Cost: Zero per-minute fees. A developer transcribing 8 hours of meetings per week saves $120β480/month at cloud STT pricing.
- Offline: Works on planes, in secure facilities, in areas without reliable internet. No API key management.
- Latency: No upload/download round-trip. For real-time voice interfaces, local processing reduces STT latency from 300β800 ms (cloud) to 50β300 ms.
- Customization: Fine-tune on domain-specific vocabulary. Run any model size that fits your hardware.
- Home assistant integration: Whisper running locally means wake words and voice commands never leave your home network. See local Whisper in Home Assistant β for the add-on setup that replaces cloud STT entirely.
Whisper Model Sizes β Foundation for Both Tools
Both whisper.cpp and faster-whisper use the same Whisper model weights, converted to their respective formats (GGML for whisper.cpp, CTranslate2 for faster-whisper). Choose your model size based on your VRAM/RAM budget and accuracy requirements.
| Model | Parameters | VRAM / RAM | English WER | Speed Factor (vs real-time on RTX 4070) |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | 7.6% | ~32Γ |
| base | 74M | ~1 GB | 5.0% | ~16Γ |
| small | 244M | ~2 GB | 3.4% | ~6Γ |
| medium | 769M | ~5 GB | 2.9% | ~2Γ |
| large-v3 | 1.55B | ~10 GB | 2.5% | 1Γ (baseline) |
| distil-large-v3 | ~756M | ~4 GB | ~2.6% | ~6Γ |
WER (word error rate) figures from the Whisper paper on the LibriSpeech clean test set. Lower is better. Speed factors for faster-whisper int8 on RTX 4070. distil-large-v3 figures from the Distil-Whisper paper.
Distil-Whisper: The Faster Alternative
distil-whisper/distil-large-v3 is a distilled variant of large-v3 with ~50% fewer parameters, running ~6Γ faster while keeping WER within ~1% of the original.** It is the right choice when transcription speed matters more than squeezing out the last fraction of accuracy. distil-large-v3 works with both faster-whisper (native CTranslate2 support) and whisper.cpp (via GGML format conversion), so it integrates into whichever runtime you already use.
- Parameters: ~756M β roughly half of large-v3's 1.55B, fitting in ~4 GB VRAM instead of ~10 GB.
- Speed: ~6Γ real-time on RTX 4070 (vs. 1Γ baseline for large-v3) β comparable to the medium model in speed, with large-v3-level accuracy.
- WER: ~2.6% on English β only ~0.1% higher than large-v3's 2.5%. In practice, the difference is inaudible on typical speech.
- Compatibility: Works with faster-whisper natively (
WhisperModel("distil-large-v3", device="cuda", compute_type="int8")). For whisper.cpp, convert to GGML format using the distil-whisper GGML conversion script. - Best for: Batch transcription jobs, server deployments with limited VRAM, and any use case where you want large-v3 quality at medium-model speed.
- Not for: Multilingual transcription β distil-large-v3 is English-only. For other languages, use large-v3 or medium.
whisper.cpp β The C/C++ Port
whisper.cpp (by Georgi Gerganov) is a pure C/C++ reimplementation of OpenAI's Whisper model, optimized for low-resource and cross-platform inference. It requires no Python, no CUDA toolkit, and runs on virtually any hardware β from Raspberry Pi to Apple M5 Pro to Windows CUDA setups.
- Platform support: CPU (AVX2, AVX512, ARM NEON), Apple Metal (Core ML), CUDA (NVIDIA), Vulkan (AMD/Intel GPU), OpenCL.
- Apple Silicon advantage: whisper.cpp exports models to Core ML format, enabling inference on the Apple Neural Engine. Large-v3 runs at ~10Γ real-time on M5 Pro via Metal β faster than any cloud round-trip.
- Installation: Clone the repo, run
make(orcmake). Pre-built binaries available for common platforms. No Python dependency. - Model download:
bash ./models/download-ggml-model.sh base.enβ downloads the GGML-format model file (~142 MB for base). - CLI example:
./main -m models/ggml-base.bin -f audio.wavβ transcribes a WAV file to stdout. Add-l defor German. - Real-time stream mode:
./stream -m models/ggml-base.bin --step 3000 --length 10000β transcribes from microphone in 3-second chunks. - Python wrapper: pywhispercpp provides a Python binding for whisper.cpp, enabling use in Python pipelines without sacrificing Metal acceleration.
- Limitation: No native VAD (voice activity detection). Stream mode requires tuning
--stepand--lengthparameters for your use case.
# Build from source (macOS / Linux)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j4
# Download a model
bash ./models/download-ggml-model.sh large-v3
# Transcribe a file
./main -m models/ggml-large-v3.bin -f recording.wav
# Enable Metal on Apple Silicon (Core ML)
make -j4 WHISPER_COREML=1
./main -m models/ggml-large-v3-encoder.mlmodelc -f recording.wavwhisper.cpp Latest Version & Updates (June 2026)
The latest whisper.cpp release is v1.8.6, published June 2, 2026. It continues the v1.8.x maintenance line (v1.8.4 and v1.8.5) focused on streaming VAD, server stability, and performance β not new model support. faster-whisper's current release is v1.2.1 (October 31, 2025). Both runtimes still load the same OpenAI Whisper weights; nothing about transcription accuracy changed in 2026.
- whisper.cpp v1.8.6 (June 2, 2026): Reimplemented
ffmpeg-transcodefor clearer optional FFmpeg decoding of compressed audio (MP3, M4A) plus a CI examples-path fix. - whisper.cpp v1.8.5 (May 2026): Streaming VAD (voice activity detection) improvements, server parameter-leak and token-timestamp fixes, and memory-leak fixes in the Ruby and VAD bindings.
- whisper.cpp v1.8.4 (March 19, 2026): Latest
ggmlsync with broad performance gains, a new-g/--gpu-deviceflag (andGGML_CUDAdevice selection) to pick a GPU, and UTF-8 segment-wrapping fixes. - Native VAD is now built into whisper.cpp. Recent releases added a native voice-activity-detection path, narrowing a long-standing faster-whisper advantage β though faster-whisper's Silero VAD integration remains more turnkey.
- faster-whisper v1.2.1 (October 31, 2025): Current stable release;
pip install faster-whisperpulls it. No 2026 release as of this update β the project is stable, not stalled. - No new base Whisper model in 2026. large-v3 and distil-large-v3 remain the newest weights, so "updates" to either tool are runtime and tooling changes, not accuracy changes.
faster-whisper β The CTranslate2 Port
faster-whisper (by SYSTRAN) is a Python library that reimplements Whisper inference using CTranslate2 β a highly optimized C++ inference engine that supports int8 quantization, reducing VRAM usage and increasing throughput. On NVIDIA GPUs, faster-whisper is the fastest local Whisper implementation available.
- Platform support: CPU (int8 quantization) and NVIDIA CUDA GPU. No Apple Metal support β runs CPU-only on Mac.
- int8 advantage: CTranslate2 int8 quantization reduces VRAM by ~40% and increases inference speed by ~2Γ vs float16, with negligible WER impact (< 0.1% absolute).
- Installation:
pip install faster-whisperβ no compilation required. CUDA support requires CUDA 11.8+ and cuDNN 8.x. - Built-in VAD: faster-whisper includes Silero VAD integration, which automatically skips silent audio segments β critical for real-time transcription pipelines.
- Python-native: Direct Python API makes it trivial to chain with LLMs, audio processing libraries, and web frameworks.
- Speed: large-v3 int8 on RTX 4070 runs at ~12Γ real-time and uses ~2.5 GB VRAM. CPU int8 achieves ~20Γ real-time for the tiny model.
- Batch processing: faster-whisper supports batched inference for processing large audio archives efficiently.
- Limitation: No Metal support on Mac β runs CPU-only on Apple Silicon, achieving ~3Γ real-time for large-v3 vs. whisper.cpp's ~10Γ with Metal.
from faster_whisper import WhisperModel
# Load model (downloads automatically on first run)
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
# Transcribe
segments, info = model.transcribe("audio.wav", beam_size=5)
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s β {segment.end:.2f}s] {segment.text}")Get ggml-base.bin via pip (Python Setup)
**To use ggml-base.bin from Python via pip, install the pywhispercpp binding β pip install pywhispercpp β then load Model("base"), which downloads ggml-base.bin automatically on first run.** There is no pip install whisper.cpp; pywhispercpp is the pip wrapper around the whisper.cpp C/C++ engine, and it keeps Metal and CUDA acceleration. The GGML model file (ggml-base.bin, ~142 MB) is downloaded separately from Hugging Face and cached locally β the pip package ships the inference engine, not the weights.
π In One Sentence
Run pip install pywhispercpp, then Model("base") in Python fetches and caches ggml-base.bin automatically on first use.
π¬ In Plain Terms
pip installs the program (pywhispercpp); the ggml-base.bin model file is a separate ~142 MB download that happens the first time you load the "base" model β or that you grab by hand with the download-ggml-model.sh script.
- Install the binding:
pip install pywhispercpp. For NVIDIA CUDA acceleration, build withGGML_CUDA=1 pip install git+https://github.com/absadiki/pywhispercpp. - Auto-download ggml-base.bin:
from pywhispercpp.model import Modelthenmodel = Model("base")downloadsggml-base.binfrom Hugging Face (ggerganov/whisper.cpp) on first run and caches it. Use"base.en"for the English-only variant. - Manual download (CLI route, no pip): From a cloned whisper.cpp repo,
bash ./models/download-ggml-model.sh basesaves the file tomodels/ggml-base.binβ the same file pywhispercpp fetches. - Direct download:
curl -L -o ggml-base.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.binpulls the model with no script. Swapbasefortiny,small,medium, orlarge-v3. - Do not confuse the packages:
pip install openai-whisperinstalls OpenAI's original PyTorch Whisper (it uses.ptcheckpoints, notggml-base.bin).pip install faster-whisperuses the CTranslate2 format. Onlypywhispercppconsumes GGMLggml-*.binfiles. - Where the file lives: pywhispercpp stores downloaded GGML models under its module cache (or a path you pass via
Model("base", models_dir="...")), so the sameggml-base.binis reused across runs β no re-download.
# 1. Install the pip binding for whisper.cpp
pip install pywhispercpp
# 2. ggml-base.bin downloads automatically on first use
# from pywhispercpp.model import Model
# model = Model("base") # fetches & caches ggml-base.bin (~142 MB)
# for seg in model.transcribe("audio.wav"):
# print(seg.text)
# --- Or get ggml-base.bin manually (no pip) ---
# From a cloned whisper.cpp checkout:
bash ./models/download-ggml-model.sh base # -> models/ggml-base.bin
# Or download the file directly from Hugging Face:
curl -L -o ggml-base.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.binHead-to-Head Benchmark Table
All benchmarks use the large-v3 model unless noted. Speed is measured in multiples of real-time (e.g., 10Γ means 60 minutes of audio transcribed in 6 minutes). VRAM figures for GPU runs; RAM figures for CPU runs.
π In One Sentence
On Apple Silicon, whisper.cpp with Metal runs large-v3 at ~10Γ real-time; on NVIDIA GPUs, faster-whisper with int8 runs at ~12Γ real-time β each tool wins decisively on its target platform.
π¬ In Plain Terms
Pick whisper.cpp on Mac (it uses the Apple Neural Engine), and faster-whisper on Windows/Linux with an NVIDIA GPU (it processes audio 12Γ faster than real-time while using 40% less GPU memory).
| Metric | whisper.cpp (large-v3) | faster-whisper (large-v3) |
|---|---|---|
| Platform / language | C/C++ (cross-platform) | Python (CTranslate2) |
| GPU support | CUDA, Metal, Vulkan | CUDA only |
| CPU optimization | AVX2, ARM NEON | int8 quantization |
| Speed β RTX 4070, large-v3 | ~8Γ real-time | ~12Γ real-time β |
| Speed β M5 Pro, large-v3 | ~10Γ real-time (Metal) β | ~3Γ real-time (CPU only) |
| Speed β CPU only (x86), base | ~15Γ real-time | ~20Γ real-time β |
| VRAM β large-v3, GPU | ~3 GB | ~2.5 GB (int8) β |
| Python integration | Needs wrapper (pywhispercpp) | Native β |
| VAD (silence detection) | Manual (--step tuning) | Built-in (Silero VAD) β |
| Real-time streaming | Yes (--stream flag) β | Yes (VAD pipeline) |
| WER accuracy (large-v3) | 2.5% (identical) | 2.5% (identical) |
| Python dependency | None β | Python 3.8+ |
| Raspberry Pi / embedded | Yes β C binary β | Limited β Python overhead |
| Output formats | SRT, VTT, JSON, CSV, txt | Python objects (start, end, text) |
whisper.cpp writes output directly to standard subtitle and transcript file formats (SRT, VTT, JSON, CSV, txt) β ideal for subtitle workflows where you need a file on disk with no additional code. faster-whisper yields a Python generator of segment objects with start, end, and text attributes β ideal for LLM pipeline chaining, where you pass segment text directly into a downstream model without writing intermediate files. For subtitle generation, whisper.cpp is simpler. For pipelines that process segments programmatically, faster-whisper is simpler.
Real-Time Transcription Setup
Real-time transcription processes audio in chunks as it arrives from a microphone, producing text with a short lag behind speech. Both tools support this, but with different trade-offs.
- whisper.cpp stream mode: Run
./stream -m models/ggml-small.bin --step 3000 --length 10000 -t 4. Processes 3-second audio chunks; ~0.5β1.5 second lag with the small model. No Python needed. - faster-whisper VAD pipeline: Use
vad_filter=Trueinmodel.transcribe(). Silero VAD automatically segments audio at silence boundaries β more natural chunks than fixed-length windows. - Practical latency: 0.5β2 seconds behind live speech with small or medium models. Use tiny for the lowest latency (< 0.5 seconds, but higher WER).
- Model selection for real-time: small or base is the practical sweet spot β fast enough to keep up with speech, accurate enough for clean audio. Avoid large-v3 for real-time unless you have a dedicated GPU.
- Microphone input: whisper.cpp reads raw audio via SDL2 or portaudio. faster-whisper reads audio arrays from any Python audio library (sounddevice, pyaudio, soundfile).
- Stability: whisper.cpp stream mode can produce repeated tokens ("hallucinate" short fillers) on silence. Suppress with
--suppress-blankand--no-speech-threshold.
Apple Silicon: whisper.cpp Wins
On M1, M2, M3, M4, and M5 Macs, whisper.cpp with Core ML / Metal acceleration is the correct tool β no question. faster-whisper has no Metal support and runs CPU-only on Mac, achieving roughly 3Γ real-time for large-v3. whisper.cpp with Metal achieves ~10Γ real-time on M5 Pro β a 3Γ speed advantage.
- Core ML export: Run
./models/generate-coreml-model.sh large-v3to export the encoder to Core ML format. This offloads encoder inference to the Apple Neural Engine. - M5 Pro benchmark (large-v3, Metal): ~10Γ real-time. 60 minutes of audio transcribes in ~6 minutes. Note: M5 Pro shipped March 2026 β these are early community benchmarks. Performance may improve with whisper.cpp updates optimizing for the M5 Neural Engine.
- M3 MacBook Air benchmark (large-v3, Metal): ~7Γ real-time. 60 minutes in ~8.5 minutes.
- Memory: Unified memory means no separate VRAM β a 16 GB M5 Pro can comfortably run large-v3 (~3 GB) alongside other processes.
- faster-whisper on Mac: CPU-only, int8. Large-v3 at ~3Γ real-time. Usable for batch overnight transcription but not for real-time or time-sensitive workflows.
- Recommendation: Use whisper.cpp for all Mac STT work. Add pywhispercpp if you need Python integration while retaining Metal acceleration.
NVIDIA GPU: faster-whisper Wins
On Windows and Linux with NVIDIA GPUs, faster-whisper is the superior choice. Its CTranslate2 CUDA backend is more optimized than whisper.cpp's CUDA path β ~12Γ vs. ~8Γ real-time for large-v3 on RTX 4070, with lower VRAM usage.
- RTX 4070 (12 GB) benchmark (large-v3 int8): ~12Γ real-time, ~2.5 GB VRAM.
- RTX 3060 (12 GB) benchmark (large-v3 int8): ~8Γ real-time, ~2.5 GB VRAM.
- RTX 4060 (8 GB) benchmark (large-v3 int8): ~7Γ real-time, ~2.5 GB VRAM β easily fits.
- int8 vs float16: int8 is ~2Γ faster and uses ~40% less VRAM with negligible accuracy loss. Always use
compute_type="int8"on NVIDIA. - Batch processing: faster-whisper's
batched=Trueparameter enables parallel processing of multiple audio files, maximizing GPU utilization for large transcription jobs. - Python pipeline integration: faster-whisper slots directly into LangChain, Haystack, and custom Python pipelines. No subprocess overhead vs. wrapping whisper.cpp.
Whisper on Raspberry Pi 5: tiny, base & small
On a Raspberry Pi 5, whisper.cpp runs the tiny and base models in real time for live transcription, while the small model runs below real time and is best kept to batch jobs. The Pi 5's quad-core Cortex-A76 has no usable GPU path for Whisper, so all inference is CPU (ARM NEON). Use the quantized GGML models (ggml-*-q5_0.bin) to cut memory use and speed inference.
π In One Sentence
On a Raspberry Pi 5, the whisper.cpp small model runs slower than real time (~0.5Γ, batch only); tiny and base are the models that keep up with live speech.
π¬ In Plain Terms
A Pi 5 can transcribe live audio with the tiny and base Whisper models, but the small model is too heavy to keep up in real time β use small for recorded files you process in the background, not for a live microphone.
- Whisper small on Pi 5: Roughly 0.4β0.6Γ real-time on CPU (a 10-minute clip takes ~17β25 minutes). Accurate at 3.4% WER but too slow for live captioning β use it for overnight/batch transcription.
- Whisper base on Pi 5: Around real-time to ~2Γ with multithreading (
-t 4). The practical sweet spot for live transcription on the Pi, with acceptable 5.0% WER. - Whisper tiny on Pi 5: Fastest option, comfortably faster than real-time for live use, at the cost of higher 7.6% WER. Best for wake-word and short-command recognition.
- Use quantized models:
bash ./models/download-ggml-model.sh small-q5_0trims RAM and speeds inference with minimal accuracy loss β important on the Pi 5's shared memory. - faster-whisper on Pi 5: Installable (
pip install faster-whisper, CPU int8) and competitive for batch jobs, but whisper.cpp's lighter C binary is the more common embedded choice. - Thermals: Sustained transcription pushes the Pi 5 CPU hard β use active cooling to avoid thermal throttling that erodes these numbers.
When to Use Which
A direct mapping from your scenario to the right tool:
π In One Sentence
Use whisper.cpp on Apple Silicon and embedded/cross-platform targets; use faster-whisper on NVIDIA GPUs and Python pipelines.
π¬ In Plain Terms
If you have a Mac, pick whisper.cpp β it's 3Γ faster than faster-whisper on Apple hardware. If you have an NVIDIA GPU and write Python, pick faster-whisper β it's faster and needs 40% less GPU memory.
| Scenario | Best Choice | Why |
|---|---|---|
| Apple Silicon Mac (any model) | whisper.cpp | Metal / Core ML acceleration β 3Γ faster than faster-whisper (CPU-only on Mac) |
| NVIDIA GPU server (Linux/Windows) | faster-whisper | CTranslate2 int8 β faster and lower VRAM than whisper.cpp CUDA path |
| Python data pipeline | faster-whisper | Native Python API; no subprocess wrapper; VAD built in |
| Raspberry Pi / embedded Linux | whisper.cpp | Pure C binary; no Python runtime overhead; ARM NEON optimized |
| Real-time voice assistant | whisper.cpp | Stream mode with low overhead; works without Python on Pi / embedded |
| Batch transcription (large audio archive) | faster-whisper | Batched inference, GPU utilization, Python async integration |
| AMD GPU (Vulkan) | whisper.cpp | Vulkan backend support; faster-whisper is CUDA-only |
| CPU-only Linux server | faster-whisper | int8 quantization gives ~30% speed advantage on x86 CPU |
Beyond whisper.cpp and faster-whisper
Two additional tools extend Whisper with capabilities that neither whisper.cpp nor faster-whisper provide out of the box: speaker diarization and extreme-speed batch GPU inference.
- WhisperX:** Built on top of faster-whisper, WhisperX adds word-level timestamps and speaker diarization β identifying which speaker said which words. Best for meeting transcription with speaker labels, podcast editing, and interview transcripts. Install with
pip install whisperxand provide a Hugging Face token for the diarization model. - insanely-fast-whisper:** A Hugging Face Transformers pipeline wrapper that adds Flash Attention 2 support for significantly faster GPU inference than standard faster-whisper on NVIDIA hardware. Best for batch processing large audio archives on NVIDIA GPUs. Requires a Flash Attention 2-compatible GPU (Ampere or newer: RTX 3000+, A100, H100).
Common Issues and Fixes
The most frequent setup and runtime problems, with direct fixes:
- CUDA version mismatch: faster-whisper requires CUDA 11.8 or later. Check with
nvcc --version. If your CUDA is older, either upgrade the driver or install faster-whisper in a conda environment withcudatoolkit=11.8. - Metal model export fails: Ensure Xcode Command Line Tools are installed β run
xcode-select --install. The Core ML export script requires thecoremltoolsPython package:pip install coremltools. - Hallucination on silence: Both tools can produce repeated filler tokens on silent audio segments. Use
--no-speech-threshold 0.6in whisper.cpp stream mode, orvad_filter=Truein faster-whisper'smodel.transcribe()to skip silent segments automatically. - Out of memory on large-v3: Switch to int8 quantization in faster-whisper (
compute_type="int8") β reduces VRAM from ~5 GB (float16) to ~2.5 GB. In whisper.cpp, use the quantized GGML variant (e.g.,ggml-large-v3-q5_0.bin) which cuts memory to ~3β4 GB. - Garbled output on non-English audio: Do not use
.enmodel variants (tiny.en, base.en) for non-English speech β they are English-only. Use the multilingual models (base, small, medium, large-v3) and set the language explicitly:-l dein whisper.cpp orlanguage="de"in faster-whisper. - Slow CPU inference: Ensure your CPU supports AVX2 instructions (required for optimized CPU inference). Check with
grep avx2 /proc/cpuinfoon Linux orsysctl machdep.cpu.featureson Mac. CPUs without AVX2 fall back to generic SIMD and will be 2β3Γ slower.
Frequently Asked Questions
Is the transcription accuracy the same between whisper.cpp and faster-whisper?
Yes. Both tools use the same OpenAI Whisper model weights β the model itself is identical. The difference is only in the inference runtime (C/C++ vs CTranslate2 Python). WER on the same audio file will be within 0.1% absolute of each other, which is within normal variation from beam search randomness.
Can I use faster-whisper on a Mac with Apple Silicon?
Yes, but it runs CPU-only β faster-whisper has no Metal support. On an M5 Pro, faster-whisper large-v3 runs at ~3Γ real-time (CPU int8), compared to whisper.cpp's ~10Γ real-time with Metal. For most Mac users, whisper.cpp is 3Γ faster for the same model. The only reason to use faster-whisper on Mac is if your Python pipeline already depends on it and speed is not critical.
What Whisper model size should I use for a voice assistant?
For real-time voice interfaces, Whisper small is the standard recommendation β 3.4% WER on clean English, ~200 ms STT latency on a modern CPU or GPU, and fits in 2 GB RAM. Use tiny if you are on very constrained hardware (Raspberry Pi Zero 2W, older phones) and can tolerate ~7.6% WER. Use medium or large-v3 only for batch transcription where latency is not a constraint.
Does whisper.cpp support languages other than English?
Yes. All Whisper multilingual models (base, small, medium, large-v3) support 99 languages. Add `-l [language code] to the CLI: -l de for German, -l fr for French, -l ja` for Japanese, etc. The tiny.en and base.en models are English-only and slightly more accurate for English than their multilingual equivalents.
How do I install faster-whisper with CUDA support?
Install with pip install faster-whisper. CUDA support requires CUDA 11.8 or later and cuDNN 8.x installed on your system. Verify your CUDA version with nvcc --version. Then specify device="cuda" when loading the model: WhisperModel("large-v3", device="cuda", compute_type="int8"). If CUDA is not detected, faster-whisper falls back to CPU automatically.
Which is more accurate β whisper.cpp or faster-whisper?
Identical. Both tools use the same OpenAI Whisper model weights and produce the same WER on any given audio file. The difference between whisper.cpp and faster-whisper is speed and platform support, not transcription accuracy. Any WER difference you measure between runs is within normal variation from beam search, not from the runtime itself.
Can I run Whisper large-v3 on 8 GB RAM?
Yes on GPU β large-v3 int8 in faster-whisper uses ~2.5 GB VRAM and runs on any 8 GB GPU. On CPU-only hardware, 8 GB RAM is tight for large-v3 (float32 uses ~10 GB). Use medium (5 GB RAM) or small (2 GB RAM) on CPU-only systems. whisper.cpp is more memory-efficient on CPU than faster-whisper due to lower runtime overhead.
How much does local Whisper cost vs cloud STT?
Zero ongoing cost. Cloud STT services charge $0.006β$0.024 per audio minute β for a developer transcribing 8 hours of meetings per week, that's $120β480/month. Local Whisper runs on hardware you already own, with no per-minute fees, no API key management, and no audio data leaving your machine.
Sources
- whisper.cpp on GitHub β Source, build instructions, model download scripts, and Metal/Core ML setup guide.
- faster-whisper on GitHub β Source, Python API documentation, and benchmark results.
- distil-whisper/distil-large-v3 on Hugging Face β Model card, benchmark results, and usage instructions for the distilled Whisper variant.
- WhisperX on GitHub β Word-level timestamps and speaker diarization built on faster-whisper.
- insanely-fast-whisper on GitHub β Flash Attention 2 Whisper pipeline for maximum NVIDIA GPU throughput.
- OpenAI Whisper on GitHub β Original Whisper model, paper, and model cards for all sizes.
- OpenAI Whisper paper (Radford et al., 2022) β "Robust Speech Recognition via Large-Scale Weak Supervision." Source of WER figures.
- CTranslate2 documentation β Quantization details, hardware support, and int8 optimization rationale.