PromptQuorumPromptQuorum
Home/Power Local LLM/Whisper.cpp vs faster-whisper 2026: Local STT Benchmarks, Setup & GPU Acceleration
Voice, Speech & Multimodal

Whisper.cpp vs faster-whisper 2026: Local STT Benchmarks, Setup & GPU Acceleration

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

whisper.cpp vs faster-whisper β€” the two dominant local Whisper runtimes β€” each win decisively on their target platform. For Apple Silicon (M-series Macs), whisper.cpp with Metal acceleration is the fastest local STT option in 2026 β€” large-v3 runs at ~10Γ— real-time on an M5 Pro. For NVIDIA GPU servers and Python pipelines, faster-whisper with CTranslate2 int8 quantization is the better choice, achieving ~12Γ— real-time on an RTX 4070 with 2.5 GB VRAM for the large-v3 model. Both tools use the same underlying Whisper models from OpenAI (tiny through large-v3); the difference is runtime optimization and integration path. On CPU-only hardware, both are usable for the tiny and base models β€” faster-whisper has a slight edge (~20Γ— real-time vs. ~15Γ—) on CPU via int8.

whisper.cpp and faster-whisper are the two dominant implementations of OpenAI's Whisper speech-to-text model for local, offline transcription in 2026. whisper.cpp is a pure C/C++ port that runs on Apple Metal, CUDA, Vulkan, and CPU β€” making it ideal for Apple Silicon, embedded systems, and real-time voice applications. faster-whisper is a Python library using CTranslate2 that achieves ~4Γ— the throughput of the original Whisper on NVIDIA GPUs via int8 quantization. This guide covers installation, performance benchmarks, real-time transcription setup, and a head-to-head comparison across platforms so you can pick the right tool for your pipeline.

Key Takeaways

  • whisper.cpp is the best local STT choice for Apple Silicon. The C/C++ port leverages Core ML and Apple Metal for hardware acceleration β€” large-v3 at ~10Γ— real-time on M5 Pro, with no Python dependency required.
  • faster-whisper is the best local STT choice for NVIDIA GPUs and Python pipelines. CTranslate2 int8 quantization cuts VRAM by ~40% and boosts throughput by ~4Γ— over the original OpenAI implementation β€” large-v3 at ~12Γ— real-time on RTX 4070, using only ~2.5 GB VRAM.
  • Both tools use identical Whisper model weights from OpenAI. WER (word error rate) is the same for both β€” the difference is entirely in runtime performance and integration pathway, not transcription accuracy.
  • Whisper large-v3 gives the best accuracy at 2.5% WER on English. For most production use cases, Whisper small (3.4% WER, 2 GB RAM) or medium (2.9% WER, 5 GB RAM) offers a better speed-accuracy trade-off.
  • Real-time transcription is achievable with both tools β€” whisper.cpp via its --stream flag, faster-whisper via its built-in VAD (voice activity detection) pipeline. Practical latency is 0.5–2 seconds behind live speech depending on model size.
  • whisper.cpp runs on CPU, Metal, CUDA, and Vulkan β€” making it the only choice for cross-platform embedded use (Raspberry Pi, Windows GPU setups, ARM servers). faster-whisper supports CPU and CUDA only (no Metal on Mac).
  • For Raspberry Pi and embedded Linux, whisper.cpp tiny/base on CPU is the practical ceiling β€” tiny at ~15Γ— real-time on Pi 5, base at ~6Γ— real-time. Both fit within 1 GB RAM.

Quick Facts

  • Both tools: Based on OpenAI's open-source Whisper model (MIT license). Same accuracy β€” different runtimes.
  • whisper.cpp: Written in C/C++ by ggerganov. Supports CPU (AVX2/NEON), CUDA, Metal (Apple), Vulkan. No Python required.
  • faster-whisper: Python library using CTranslate2. Supports CPU (int8) and CUDA. No Apple Metal support.
  • Whisper model sizes: tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.55B). All use the same ggml / CTranslate2 format.
  • Best model for most use cases: Whisper small β€” 3.4% WER, runs in 2 GB RAM, 6Γ— real-time on modern CPU.
  • RTX 4070 benchmark (large-v3): faster-whisper ~12Γ— real-time; whisper.cpp CUDA ~8Γ— real-time. faster-whisper wins on NVIDIA.
  • M5 Pro benchmark (large-v3): whisper.cpp Metal ~10Γ— real-time; faster-whisper CPU-only ~3Γ— real-time. whisper.cpp wins on Apple.

Why Local Speech-to-Text?

Cloud STT services (Google Speech-to-Text, AWS Transcribe, Azure Speech) charge per audio minute β€” typically $0.006–$0.024/minute β€” and send your audio to remote servers. For privacy-sensitive applications (medical dictation, legal recordings, journalist interviews, corporate meetings), local transcription eliminates the data exposure entirely.

  • Privacy: Audio never leaves your machine. No data-processing agreement needed for GDPR compliance β€” processing happens locally.
  • Cost: Zero per-minute fees. A developer transcribing 8 hours of meetings per week saves $120–480/month at cloud STT pricing.
  • Offline: Works on planes, in secure facilities, in areas without reliable internet. No API key management.
  • Latency: No upload/download round-trip. For real-time voice interfaces, local processing reduces STT latency from 300–800 ms (cloud) to 50–300 ms.
  • Customization: Fine-tune on domain-specific vocabulary. Run any model size that fits your hardware.

Whisper Model Sizes β€” Foundation for Both Tools

Both whisper.cpp and faster-whisper use the same Whisper model weights, converted to their respective formats (GGML for whisper.cpp, CTranslate2 for faster-whisper). Choose your model size based on your VRAM/RAM budget and accuracy requirements.

ModelParametersVRAM / RAMEnglish WERSpeed Factor (vs real-time on RTX 4070)
tiny39M~1 GB7.6%~32Γ—
base74M~1 GB5.0%~16Γ—
small244M~2 GB3.4%~6Γ—
medium769M~5 GB2.9%~2Γ—
large-v31.55B~10 GB2.5%1Γ— (baseline)
distil-large-v3~756M~4 GB~2.6%~6Γ—

WER (word error rate) figures from the Whisper paper on the LibriSpeech clean test set. Lower is better. Speed factors for faster-whisper int8 on RTX 4070. distil-large-v3 figures from the Distil-Whisper paper.

Distil-Whisper: The Faster Alternative

distil-whisper/distil-large-v3 is a distilled variant of large-v3 with ~50% fewer parameters, running ~6Γ— faster while keeping WER within ~1% of the original.** It is the right choice when transcription speed matters more than squeezing out the last fraction of accuracy. distil-large-v3 works with both faster-whisper (native CTranslate2 support) and whisper.cpp (via GGML format conversion), so it integrates into whichever runtime you already use.

  • Parameters: ~756M β€” roughly half of large-v3's 1.55B, fitting in ~4 GB VRAM instead of ~10 GB.
  • Speed: ~6Γ— real-time on RTX 4070 (vs. 1Γ— baseline for large-v3) β€” comparable to the medium model in speed, with large-v3-level accuracy.
  • WER: ~2.6% on English β€” only ~0.1% higher than large-v3's 2.5%. In practice, the difference is inaudible on typical speech.
  • Compatibility: Works with faster-whisper natively (WhisperModel("distil-large-v3", device="cuda", compute_type="int8")). For whisper.cpp, convert to GGML format using the distil-whisper GGML conversion script.
  • Best for: Batch transcription jobs, server deployments with limited VRAM, and any use case where you want large-v3 quality at medium-model speed.
  • Not for: Multilingual transcription β€” distil-large-v3 is English-only. For other languages, use large-v3 or medium.

whisper.cpp β€” The C/C++ Port

whisper.cpp (by Georgi Gerganov) is a pure C/C++ reimplementation of OpenAI's Whisper model, optimized for low-resource and cross-platform inference. It requires no Python, no CUDA toolkit, and runs on virtually any hardware β€” from Raspberry Pi to Apple M5 Pro to Windows CUDA setups.

  • Platform support: CPU (AVX2, AVX512, ARM NEON), Apple Metal (Core ML), CUDA (NVIDIA), Vulkan (AMD/Intel GPU), OpenCL.
  • Apple Silicon advantage: whisper.cpp exports models to Core ML format, enabling inference on the Apple Neural Engine. Large-v3 runs at ~10Γ— real-time on M5 Pro via Metal β€” faster than any cloud round-trip.
  • Installation: Clone the repo, run make (or cmake). Pre-built binaries available for common platforms. No Python dependency.
  • Model download: bash ./models/download-ggml-model.sh base.en β€” downloads the GGML-format model file (~142 MB for base).
  • CLI example: ./main -m models/ggml-base.bin -f audio.wav β€” transcribes a WAV file to stdout. Add -l de for German.
  • Real-time stream mode: ./stream -m models/ggml-base.bin --step 3000 --length 10000 β€” transcribes from microphone in 3-second chunks.
  • Python wrapper: pywhispercpp provides a Python binding for whisper.cpp, enabling use in Python pipelines without sacrificing Metal acceleration.
  • Limitation: No native VAD (voice activity detection). Stream mode requires tuning --step and --length parameters for your use case.
bash
# Build from source (macOS / Linux)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j4

# Download a model
bash ./models/download-ggml-model.sh large-v3

# Transcribe a file
./main -m models/ggml-large-v3.bin -f recording.wav

# Enable Metal on Apple Silicon (Core ML)
make -j4 WHISPER_COREML=1
./main -m models/ggml-large-v3-encoder.mlmodelc -f recording.wav

faster-whisper β€” The CTranslate2 Port

faster-whisper (by SYSTRAN) is a Python library that reimplements Whisper inference using CTranslate2 β€” a highly optimized C++ inference engine that supports int8 quantization, reducing VRAM usage and increasing throughput. On NVIDIA GPUs, faster-whisper is the fastest local Whisper implementation available.

  • Platform support: CPU (int8 quantization) and NVIDIA CUDA GPU. No Apple Metal support β€” runs CPU-only on Mac.
  • int8 advantage: CTranslate2 int8 quantization reduces VRAM by ~40% and increases inference speed by ~2Γ— vs float16, with negligible WER impact (< 0.1% absolute).
  • Installation: pip install faster-whisper β€” no compilation required. CUDA support requires CUDA 11.8+ and cuDNN 8.x.
  • Built-in VAD: faster-whisper includes Silero VAD integration, which automatically skips silent audio segments β€” critical for real-time transcription pipelines.
  • Python-native: Direct Python API makes it trivial to chain with LLMs, audio processing libraries, and web frameworks.
  • Speed: large-v3 int8 on RTX 4070 runs at ~12Γ— real-time and uses ~2.5 GB VRAM. CPU int8 achieves ~20Γ— real-time for the tiny model.
  • Batch processing: faster-whisper supports batched inference for processing large audio archives efficiently.
  • Limitation: No Metal support on Mac β€” runs CPU-only on Apple Silicon, achieving ~3Γ— real-time for large-v3 vs. whisper.cpp's ~10Γ— with Metal.
python
from faster_whisper import WhisperModel

# Load model (downloads automatically on first run)
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# Transcribe
segments, info = model.transcribe("audio.wav", beam_size=5)

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
    print(f"[{segment.start:.2f}s β†’ {segment.end:.2f}s] {segment.text}")

Head-to-Head Benchmark Table

All benchmarks use the large-v3 model unless noted. Speed is measured in multiples of real-time (e.g., 10Γ— means 60 minutes of audio transcribed in 6 minutes). VRAM figures for GPU runs; RAM figures for CPU runs.

πŸ“ In One Sentence

On Apple Silicon, whisper.cpp with Metal runs large-v3 at ~10Γ— real-time; on NVIDIA GPUs, faster-whisper with int8 runs at ~12Γ— real-time β€” each tool wins decisively on its target platform.

πŸ’¬ In Plain Terms

Pick whisper.cpp on Mac (it uses the Apple Neural Engine), and faster-whisper on Windows/Linux with an NVIDIA GPU (it processes audio 12Γ— faster than real-time while using 40% less GPU memory).

Metricwhisper.cpp (large-v3)faster-whisper (large-v3)
Platform / languageC/C++ (cross-platform)Python (CTranslate2)
GPU supportCUDA, Metal, VulkanCUDA only
CPU optimizationAVX2, ARM NEONint8 quantization
Speed β€” RTX 4070, large-v3~8Γ— real-time~12Γ— real-time βœ“
Speed β€” M5 Pro, large-v3~10Γ— real-time (Metal) βœ“~3Γ— real-time (CPU only)
Speed β€” CPU only (x86), base~15Γ— real-time~20Γ— real-time βœ“
VRAM β€” large-v3, GPU~3 GB~2.5 GB (int8) βœ“
Python integrationNeeds wrapper (pywhispercpp)Native βœ“
VAD (silence detection)Manual (--step tuning)Built-in (Silero VAD) βœ“
Real-time streamingYes (--stream flag) βœ“Yes (VAD pipeline)
WER accuracy (large-v3)2.5% (identical)2.5% (identical)
Python dependencyNone βœ“Python 3.8+
Raspberry Pi / embeddedYes β€” C binary βœ“Limited β€” Python overhead
Output formatsSRT, VTT, JSON, CSV, txtPython objects (start, end, text)

whisper.cpp writes output directly to standard subtitle and transcript file formats (SRT, VTT, JSON, CSV, txt) β€” ideal for subtitle workflows where you need a file on disk with no additional code. faster-whisper yields a Python generator of segment objects with start, end, and text attributes β€” ideal for LLM pipeline chaining, where you pass segment text directly into a downstream model without writing intermediate files. For subtitle generation, whisper.cpp is simpler. For pipelines that process segments programmatically, faster-whisper is simpler.

Real-Time Transcription Setup

Real-time transcription processes audio in chunks as it arrives from a microphone, producing text with a short lag behind speech. Both tools support this, but with different trade-offs.

  • whisper.cpp stream mode: Run ./stream -m models/ggml-small.bin --step 3000 --length 10000 -t 4. Processes 3-second audio chunks; ~0.5–1.5 second lag with the small model. No Python needed.
  • faster-whisper VAD pipeline: Use vad_filter=True in model.transcribe(). Silero VAD automatically segments audio at silence boundaries β€” more natural chunks than fixed-length windows.
  • Practical latency: 0.5–2 seconds behind live speech with small or medium models. Use tiny for the lowest latency (< 0.5 seconds, but higher WER).
  • Model selection for real-time: small or base is the practical sweet spot β€” fast enough to keep up with speech, accurate enough for clean audio. Avoid large-v3 for real-time unless you have a dedicated GPU.
  • Microphone input: whisper.cpp reads raw audio via SDL2 or portaudio. faster-whisper reads audio arrays from any Python audio library (sounddevice, pyaudio, soundfile).
  • Stability: whisper.cpp stream mode can produce repeated tokens ("hallucinate" short fillers) on silence. Suppress with --suppress-blank and --no-speech-threshold.

Apple Silicon: whisper.cpp Wins

On M1, M2, M3, M4, and M5 Macs, whisper.cpp with Core ML / Metal acceleration is the correct tool β€” no question. faster-whisper has no Metal support and runs CPU-only on Mac, achieving roughly 3Γ— real-time for large-v3. whisper.cpp with Metal achieves ~10Γ— real-time on M5 Pro β€” a 3Γ— speed advantage.

  • Core ML export: Run ./models/generate-coreml-model.sh large-v3 to export the encoder to Core ML format. This offloads encoder inference to the Apple Neural Engine.
  • M5 Pro benchmark (large-v3, Metal): ~10Γ— real-time. 60 minutes of audio transcribes in ~6 minutes. Note: M5 Pro shipped March 2026 β€” these are early community benchmarks. Performance may improve with whisper.cpp updates optimizing for the M5 Neural Engine.
  • M3 MacBook Air benchmark (large-v3, Metal): ~7Γ— real-time. 60 minutes in ~8.5 minutes.
  • Memory: Unified memory means no separate VRAM β€” a 16 GB M5 Pro can comfortably run large-v3 (~3 GB) alongside other processes.
  • faster-whisper on Mac: CPU-only, int8. Large-v3 at ~3Γ— real-time. Usable for batch overnight transcription but not for real-time or time-sensitive workflows.
  • Recommendation: Use whisper.cpp for all Mac STT work. Add pywhispercpp if you need Python integration while retaining Metal acceleration.

NVIDIA GPU: faster-whisper Wins

On Windows and Linux with NVIDIA GPUs, faster-whisper is the superior choice. Its CTranslate2 CUDA backend is more optimized than whisper.cpp's CUDA path β€” ~12Γ— vs. ~8Γ— real-time for large-v3 on RTX 4070, with lower VRAM usage.

  • RTX 4070 (12 GB) benchmark (large-v3 int8): ~12Γ— real-time, ~2.5 GB VRAM.
  • RTX 3060 (12 GB) benchmark (large-v3 int8): ~8Γ— real-time, ~2.5 GB VRAM.
  • RTX 4060 (8 GB) benchmark (large-v3 int8): ~7Γ— real-time, ~2.5 GB VRAM β€” easily fits.
  • int8 vs float16: int8 is ~2Γ— faster and uses ~40% less VRAM with negligible accuracy loss. Always use compute_type="int8" on NVIDIA.
  • Batch processing: faster-whisper's batched=True parameter enables parallel processing of multiple audio files, maximizing GPU utilization for large transcription jobs.
  • Python pipeline integration: faster-whisper slots directly into LangChain, Haystack, and custom Python pipelines. No subprocess overhead vs. wrapping whisper.cpp.

When to Use Which

A direct mapping from your scenario to the right tool:

πŸ“ In One Sentence

Use whisper.cpp on Apple Silicon and embedded/cross-platform targets; use faster-whisper on NVIDIA GPUs and Python pipelines.

πŸ’¬ In Plain Terms

If you have a Mac, pick whisper.cpp β€” it's 3Γ— faster than faster-whisper on Apple hardware. If you have an NVIDIA GPU and write Python, pick faster-whisper β€” it's faster and needs 40% less GPU memory.

ScenarioBest ChoiceWhy
Apple Silicon Mac (any model)whisper.cppMetal / Core ML acceleration β€” 3Γ— faster than faster-whisper (CPU-only on Mac)
NVIDIA GPU server (Linux/Windows)faster-whisperCTranslate2 int8 β€” faster and lower VRAM than whisper.cpp CUDA path
Python data pipelinefaster-whisperNative Python API; no subprocess wrapper; VAD built in
Raspberry Pi / embedded Linuxwhisper.cppPure C binary; no Python runtime overhead; ARM NEON optimized
Real-time voice assistantwhisper.cppStream mode with low overhead; works without Python on Pi / embedded
Batch transcription (large audio archive)faster-whisperBatched inference, GPU utilization, Python async integration
AMD GPU (Vulkan)whisper.cppVulkan backend support; faster-whisper is CUDA-only
CPU-only Linux serverfaster-whisperint8 quantization gives ~30% speed advantage on x86 CPU

Beyond whisper.cpp and faster-whisper

Two additional tools extend Whisper with capabilities that neither whisper.cpp nor faster-whisper provide out of the box: speaker diarization and extreme-speed batch GPU inference.

  • WhisperX:** Built on top of faster-whisper, WhisperX adds word-level timestamps and speaker diarization β€” identifying which speaker said which words. Best for meeting transcription with speaker labels, podcast editing, and interview transcripts. Install with pip install whisperx and provide a Hugging Face token for the diarization model.
  • insanely-fast-whisper:** A Hugging Face Transformers pipeline wrapper that adds Flash Attention 2 support for significantly faster GPU inference than standard faster-whisper on NVIDIA hardware. Best for batch processing large audio archives on NVIDIA GPUs. Requires a Flash Attention 2-compatible GPU (Ampere or newer: RTX 3000+, A100, H100).

Common Issues and Fixes

The most frequent setup and runtime problems, with direct fixes:

  • CUDA version mismatch: faster-whisper requires CUDA 11.8 or later. Check with nvcc --version. If your CUDA is older, either upgrade the driver or install faster-whisper in a conda environment with cudatoolkit=11.8.
  • Metal model export fails: Ensure Xcode Command Line Tools are installed β€” run xcode-select --install. The Core ML export script requires the coremltools Python package: pip install coremltools.
  • Hallucination on silence: Both tools can produce repeated filler tokens on silent audio segments. Use --no-speech-threshold 0.6 in whisper.cpp stream mode, or vad_filter=True in faster-whisper's model.transcribe() to skip silent segments automatically.
  • Out of memory on large-v3: Switch to int8 quantization in faster-whisper (compute_type="int8") β€” reduces VRAM from ~5 GB (float16) to ~2.5 GB. In whisper.cpp, use the quantized GGML variant (e.g., ggml-large-v3-q5_0.bin) which cuts memory to ~3–4 GB.
  • Garbled output on non-English audio: Do not use .en model variants (tiny.en, base.en) for non-English speech β€” they are English-only. Use the multilingual models (base, small, medium, large-v3) and set the language explicitly: -l de in whisper.cpp or language="de" in faster-whisper.
  • Slow CPU inference: Ensure your CPU supports AVX2 instructions (required for optimized CPU inference). Check with grep avx2 /proc/cpuinfo on Linux or sysctl machdep.cpu.features on Mac. CPUs without AVX2 fall back to generic SIMD and will be 2–3Γ— slower.

FAQ

Is the transcription accuracy the same between whisper.cpp and faster-whisper?

Yes. Both tools use the same OpenAI Whisper model weights β€” the model itself is identical. The difference is only in the inference runtime (C/C++ vs CTranslate2 Python). WER on the same audio file will be within 0.1% absolute of each other, which is within normal variation from beam search randomness.

Can I use faster-whisper on a Mac with Apple Silicon?

Yes, but it runs CPU-only β€” faster-whisper has no Metal support. On an M5 Pro, faster-whisper large-v3 runs at ~3Γ— real-time (CPU int8), compared to whisper.cpp's ~10Γ— real-time with Metal. For most Mac users, whisper.cpp is 3Γ— faster for the same model. The only reason to use faster-whisper on Mac is if your Python pipeline already depends on it and speed is not critical.

What Whisper model size should I use for a voice assistant?

For real-time voice interfaces, Whisper small is the standard recommendation β€” 3.4% WER on clean English, ~200 ms STT latency on a modern CPU or GPU, and fits in 2 GB RAM. Use tiny if you are on very constrained hardware (Raspberry Pi Zero 2W, older phones) and can tolerate ~7.6% WER. Use medium or large-v3 only for batch transcription where latency is not a constraint.

Does whisper.cpp support languages other than English?

Yes. All Whisper multilingual models (base, small, medium, large-v3) support 99 languages. Add `-l [language code] to the CLI: -l de for German, -l fr for French, -l ja` for Japanese, etc. The tiny.en and base.en models are English-only and slightly more accurate for English than their multilingual equivalents.

How do I install faster-whisper with CUDA support?

Install with pip install faster-whisper. CUDA support requires CUDA 11.8 or later and cuDNN 8.x installed on your system. Verify your CUDA version with nvcc --version. Then specify device="cuda" when loading the model: WhisperModel("large-v3", device="cuda", compute_type="int8"). If CUDA is not detected, faster-whisper falls back to CPU automatically.

Which is more accurate β€” whisper.cpp or faster-whisper?

Identical. Both tools use the same OpenAI Whisper model weights and produce the same WER on any given audio file. The difference between whisper.cpp and faster-whisper is speed and platform support, not transcription accuracy. Any WER difference you measure between runs is within normal variation from beam search, not from the runtime itself.

Can I run Whisper large-v3 on 8 GB RAM?

Yes on GPU β€” large-v3 int8 in faster-whisper uses ~2.5 GB VRAM and runs on any 8 GB GPU. On CPU-only hardware, 8 GB RAM is tight for large-v3 (float32 uses ~10 GB). Use medium (5 GB RAM) or small (2 GB RAM) on CPU-only systems. whisper.cpp is more memory-efficient on CPU than faster-whisper due to lower runtime overhead.

How much does local Whisper cost vs cloud STT?

Zero ongoing cost. Cloud STT services charge $0.006–$0.024 per audio minute β€” for a developer transcribing 8 hours of meetings per week, that's $120–480/month. Local Whisper runs on hardware you already own, with no per-minute fees, no API key management, and no audio data leaving your machine.

Sources

← Back to Power Local LLM