Full Benchmark Table: Whisper Performance on Apple Silicon (M1βM5)
| Chip | Tiny | Base | Small | Medium | Large-v3 |
|---|---|---|---|---|---|
| M1 | 32Γ | 20Γ | 12Γ | 5Γ | β |
| M1 Pro | 38Γ | 24Γ | 16Γ | 7Γ | β |
| M1 Max | 45Γ | 30Γ | 22Γ | 10Γ | β |
| M1 Ultra | 55Γ | 38Γ | 28Γ | 14Γ | β |
| M2 | 36Γ | 23Γ | 14Γ | 6Γ | β |
| M2 Pro | 42Γ | 28Γ | 20Γ | 9Γ | β |
| M2 Max | 50Γ | 35Γ | 26Γ | 12Γ | β |
| M2 Ultra | 60Γ | 42Γ | 32Γ | 17Γ | β |
| M3 | 40Γ | 26Γ | 16Γ | 7Γ | β |
| M3 Pro | 46Γ | 32Γ | 22Γ | 10Γ | β |
| M3 Max | 55Γ | 40Γ | 30Γ | 14Γ | β |
| M4 | 44Γ | 30Γ | 18Γ | 8Γ | β |
| M4 Pro | 50Γ | 36Γ | 26Γ | 12Γ | β |
| M4 Max | 60Γ | 44Γ | 34Γ | 16Γ | β |
| M5 (base) | 48Γ | 34Γ | 22Γ | 10Γ | β |
| M5 Pro | 55Γ | 40Γ | 30Γ | 14Γ | β |
| M5 Max | 65Γ | 48Γ | 38Γ | 18Γ | β |
ΓN real-time = N seconds of audio transcribed in 1 second. Benchmarks via whisper.cpp with Metal acceleration. All M1 Pro+ can run large-v3 in real-time or faster.
Whisper Model Sizes β Which One Should You Use?
| Model | Parameters | Disk Size | RAM Usage | English WER | Best For |
|---|---|---|---|---|---|
| tiny | β | β | β | β | β |
| base | β | β | β | β | β |
| small | β | β | β | β | β |
| medium | β | β | β | β | β |
| large-v3 | β | β | β | β | β |
| large-v3-turbo | β | β | β | β | β |
| distil-large-v3 | β | β | β | β | β |
WER (Word Error Rate) on English LibriSpeech test set. Large-v3-turbo and distil-large-v3 are the sweet spot for real-time on most Macs β near-large-v3 quality at 4β6Γ the speed.
Metal vs Core ML vs Apple Neural Engine: Which Backend?
Apple Silicon offers three acceleration paths for Whisper. Each has tradeoffs.
Metal (via whisper.cpp) β Recommended: Uses Apple Metal GPU framework, compatible with all M-series chips, 10β12Γ real-time on large-v3 (M5 Pro), setup via make WHISPER_METAL=1. Best for: most users, easiest setup, proven performance.
Core ML (via Apple Core ML format) β Advanced: Uses Apple machine learning framework, can target Neural Engine (ANE) for some operations, 15β20% faster on some workloads, requires model conversion (10β15 min setup). Best for: power users wanting maximum speed.
Apple Neural Engine (ANE) β Limited Use: Dedicated AI accelerator on all M-series chips, not directly accessible (must go through Core ML), Whisper doesn't fully utilize ANE due to architecture mismatch, works best at small models (tiny, base). Best for: tiny/base Whisper on battery-constrained laptops.
Decision Matrix: First-time setup β Metal (whisper.cpp). Maximum speed on large-v3 β Metal (whisper.cpp). Battery-powered laptop, base model β Core ML with ANE. Production server β Metal (proven, reliable). Real-time transcription β Metal with streaming mode. Cloud deployment to Mac instances β Metal (containerizable).
- Metal (whisper.cpp): Faster, widely compatible, simplest setup
- Core ML: Neural Engine optimization, 15β20% speed gain on some workloads (requires conversion)
- Apple Neural Engine: Limited benefit for large models, best for tiny/base on laptops
Setup: whisper.cpp with Metal Acceleration
- 1Install dependencies
Why it matters: xcode-select --install (Xcode tools) brew install ffmpeg (audio conversion) - 2Clone and build whisper.cpp with Metal
Why it matters: git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp make WHISPER_METAL=1 ./main -h | grep -i metal - 3Download a model
Why it matters: bash ./models/download-ggml-model.sh small (466 MB, real-time) bash ./models/download-ggml-model.sh large-v3 (3 GB, best quality) bash ./models/download-ggml-model.sh large-v3-turbo (1.6 GB, balanced) - 4Transcribe an audio file
Why it matters: ./main -m models/ggml-large-v3.bin -f /path/to/audio.wav ./main -m models/ggml-large-v3.bin -f audio.wav -oj (JSON) ./main -m models/ggml-large-v3.bin -f audio.wav -l en (specify language) - 5Convert non-WAV audio first
Why it matters: ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav ./main -m models/ggml-large-v3.bin -f output.wav
Real-Time Streaming Transcription (Live Microphone)
For live transcription from microphone β voice assistants, meeting transcription, accessibility tools.
Option 1: whisper.cpp stream mode
./stream -m models/ggml-small.bin --step 500 --length 5000
# --step 500: process every 500ms
# --length 5000: keep last 5 seconds context
Option 2: Python with faster-whisper (see code block below)
Latency on M5 Pro: small model ~200ms, large-v3-turbo ~400β600ms, large-v3 ~800msβ1.2s behind real-time.
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")
buffer = []
chunk_duration = 3
sample_rate = 16000
def callback(indata, frames, time, status):
buffer.append(indata.copy())
if len(buffer) * 1024 / sample_rate >= chunk_duration:
audio = np.concatenate(buffer).flatten().astype(np.float32)
segments, _ = model.transcribe(audio, beam_size=5)
for segment in segments:
print(segment.text)
buffer.clear()
with sd.InputStream(callback=callback, channels=1, samplerate=sample_rate):
print("Listening... (Ctrl+C to stop)")
while True:
sd.sleep(1000)Voice Assistant Pipeline: Whisper + Ollama + Piper TTS
Complete code for a local voice assistant running entirely on Apple Silicon.
import sounddevice as sd
import numpy as np
import requests
import subprocess
from faster_whisper import WhisperModel
WHISPER_MODEL = "large-v3-turbo"
OLLAMA_URL = "http://localhost:11434/api/chat"
LLM_MODEL = "llama3.1:8b"
SAMPLE_RATE = 16000
whisper = WhisperModel(WHISPER_MODEL, device="cpu", compute_type="int8")
def record_audio(duration=5):
print("Listening...")
audio = sd.rec(int(duration * SAMPLE_RATE),
samplerate=SAMPLE_RATE,
channels=1,
dtype=np.float32)
sd.wait()
return audio.flatten()
def transcribe(audio):
segments, _ = whisper.transcribe(audio, beam_size=5)
return " ".join([seg.text for seg in segments])
def llm_respond(user_text):
response = requests.post(OLLAMA_URL, json={
"model": LLM_MODEL,
"messages": [{"role": "user", "content": user_text}],
"stream": False
})
return response.json()["message"]["content"]
def speak(text):
subprocess.run(
["piper", "--model", "en_US-amy-medium.onnx"],
input=text.encode(),
check=True
)
while True:
audio = record_audio(duration=5)
user_text = transcribe(audio)
print(f"You: {user_text}")
if not user_text.strip():
continue
response = llm_respond(user_text)
print(f"AI: {response}")
speak(response)Best Whisper Configuration by Mac Model
| Mac Config | Recommended Model | Real-time Multiple | Use Case |
|---|---|---|---|
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
For real-time voice assistant: use small or large-v3-turbo for lowest latency. For meeting/podcast transcription: use large-v3 for maximum accuracy (1β2 second delay acceptable).
Local Whisper vs Cloud Speech-to-Text Services
| Metric | Whisper Local (M5 Pro) | Google Speech-to-Text | OpenAI Whisper API | AssemblyAI |
|---|---|---|---|---|
| Cost per hour audio | β | β | β | β |
| Accuracy (English WER) | β | β | β | β |
| Latency | β | β | β | β |
| Privacy | β | β | β | β |
| Offline capable | β | β | β | β |
| Languages | β | β | β | β |
| Setup | β | β | β | β |
Monthly cost (8 hours/day): Whisper local $3, Google $345, OpenAI $86, AssemblyAI $156. For privacy-sensitive work (medical, legal, journalism), local Whisper is the only option. For high-volume transcription ($100+/month cloud), local Mac pays for itself in 12 months.
Is Whisper faster than cloud APIs?
Local on M5 Pro: 10Γ real-time (100ms latency). Cloud APIs: 100β500ms latency due to network. Local is faster and free.
Can Whisper handle multiple speakers?
Yes, timestamps separate speakers. Use post-processing or diarization tools to identify speaker identity.
What language support?
99 languages with auto-detect. Accuracy varies by language β English is 2.5% WER, other languages 5β15% WER.
Which Whisper model has the best speed-to-quality ratio?
Large-v3-turbo or distil-large-v3. Both achieve ~95% of large-v3 accuracy at 4β6Γ the speed. Recommended for most real-time use cases.
Can Whisper handle accented English or non-native speakers?
Yes, but WER increases. Native English: ~2.5%. Strong accent/non-native: 5β12%. Large-v3 handles accents better than smaller models.
Does Whisper work for podcasts and music transcription?
Podcasts: yes, excellent for spoken-word. Music with lyrics: poor β Whisper is trained for speech. Use specialized models for music.
How accurate is Whisper for technical terminology?
Variable. Common technical terms: good. Highly specialized terms: may transcribe incorrectly. Use --prompt flag with expected vocabulary to improve accuracy.
Can I run multiple Whisper instances on one Mac?
Yes, memory-bound. M5 Pro 36GB: 2 simultaneous large-v3 instances. M5 Max 128GB: 4β6 instances or one instance plus LLM/TTS.