PromptQuorumPromptQuorum
Home/Local LLMs/Whisper on Apple Silicon 2026: Metal Benchmarks, Core ML Setup, M1–M5 Speed Guide
Hardware & Performance

Whisper on Apple Silicon 2026: Metal Benchmarks, Core ML Setup, M1–M5 Speed Guide

Β·14 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Whisper large-v3 on M5 Pro: 10–12Γ— real-time. Metal GPU automatic. Large-v3-turbo balances speed + accuracy at 14–18Γ—. Zero cost, fully offline.

Whisper speech-to-text on Apple Silicon: Metal and Core ML benchmarks for M1 through M5 Max. Setup guide, model selection, real-time transcription.

Full Benchmark Table: Whisper Performance on Apple Silicon (M1–M5)

ChipTinyBaseSmallMediumLarge-v3
M132Γ—20Γ—12Γ—5Γ—β€”
M1 Pro38Γ—24Γ—16Γ—7Γ—β€”
M1 Max45Γ—30Γ—22Γ—10Γ—β€”
M1 Ultra55Γ—38Γ—28Γ—14Γ—β€”
M236Γ—23Γ—14Γ—6Γ—β€”
M2 Pro42Γ—28Γ—20Γ—9Γ—β€”
M2 Max50Γ—35Γ—26Γ—12Γ—β€”
M2 Ultra60Γ—42Γ—32Γ—17Γ—β€”
M340Γ—26Γ—16Γ—7Γ—β€”
M3 Pro46Γ—32Γ—22Γ—10Γ—β€”
M3 Max55Γ—40Γ—30Γ—14Γ—β€”
M444Γ—30Γ—18Γ—8Γ—β€”
M4 Pro50Γ—36Γ—26Γ—12Γ—β€”
M4 Max60Γ—44Γ—34Γ—16Γ—β€”
M5 (base)48Γ—34Γ—22Γ—10Γ—β€”
M5 Pro55Γ—40Γ—30Γ—14Γ—β€”
M5 Max65Γ—48Γ—38Γ—18Γ—β€”

Γ—N real-time = N seconds of audio transcribed in 1 second. Benchmarks via whisper.cpp with Metal acceleration. All M1 Pro+ can run large-v3 in real-time or faster.

Whisper Model Sizes β€” Which One Should You Use?

ModelParametersDisk SizeRAM UsageEnglish WERBest For
tinyβ€”β€”β€”β€”β€”
baseβ€”β€”β€”β€”β€”
smallβ€”β€”β€”β€”β€”
mediumβ€”β€”β€”β€”β€”
large-v3β€”β€”β€”β€”β€”
large-v3-turboβ€”β€”β€”β€”β€”
distil-large-v3β€”β€”β€”β€”β€”

WER (Word Error Rate) on English LibriSpeech test set. Large-v3-turbo and distil-large-v3 are the sweet spot for real-time on most Macs β€” near-large-v3 quality at 4–6Γ— the speed.

Metal vs Core ML vs Apple Neural Engine: Which Backend?

Apple Silicon offers three acceleration paths for Whisper. Each has tradeoffs.

Metal (via whisper.cpp) β€” Recommended: Uses Apple Metal GPU framework, compatible with all M-series chips, 10–12Γ— real-time on large-v3 (M5 Pro), setup via make WHISPER_METAL=1. Best for: most users, easiest setup, proven performance.

Core ML (via Apple Core ML format) β€” Advanced: Uses Apple machine learning framework, can target Neural Engine (ANE) for some operations, 15–20% faster on some workloads, requires model conversion (10–15 min setup). Best for: power users wanting maximum speed.

Apple Neural Engine (ANE) β€” Limited Use: Dedicated AI accelerator on all M-series chips, not directly accessible (must go through Core ML), Whisper doesn't fully utilize ANE due to architecture mismatch, works best at small models (tiny, base). Best for: tiny/base Whisper on battery-constrained laptops.

Decision Matrix: First-time setup β†’ Metal (whisper.cpp). Maximum speed on large-v3 β†’ Metal (whisper.cpp). Battery-powered laptop, base model β†’ Core ML with ANE. Production server β†’ Metal (proven, reliable). Real-time transcription β†’ Metal with streaming mode. Cloud deployment to Mac instances β†’ Metal (containerizable).

  • Metal (whisper.cpp): Faster, widely compatible, simplest setup
  • Core ML: Neural Engine optimization, 15–20% speed gain on some workloads (requires conversion)
  • Apple Neural Engine: Limited benefit for large models, best for tiny/base on laptops

Setup: whisper.cpp with Metal Acceleration

  1. 1
    Install dependencies
    Why it matters: xcode-select --install (Xcode tools) brew install ffmpeg (audio conversion)
  2. 2
    Clone and build whisper.cpp with Metal
    Why it matters: git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp make WHISPER_METAL=1 ./main -h | grep -i metal
  3. 3
    Download a model
    Why it matters: bash ./models/download-ggml-model.sh small (466 MB, real-time) bash ./models/download-ggml-model.sh large-v3 (3 GB, best quality) bash ./models/download-ggml-model.sh large-v3-turbo (1.6 GB, balanced)
  4. 4
    Transcribe an audio file
    Why it matters: ./main -m models/ggml-large-v3.bin -f /path/to/audio.wav ./main -m models/ggml-large-v3.bin -f audio.wav -oj (JSON) ./main -m models/ggml-large-v3.bin -f audio.wav -l en (specify language)
  5. 5
    Convert non-WAV audio first
    Why it matters: ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav ./main -m models/ggml-large-v3.bin -f output.wav

Real-Time Streaming Transcription (Live Microphone)

For live transcription from microphone β€” voice assistants, meeting transcription, accessibility tools.

Option 1: whisper.cpp stream mode

./stream -m models/ggml-small.bin --step 500 --length 5000

# --step 500: process every 500ms

# --length 5000: keep last 5 seconds context

Option 2: Python with faster-whisper (see code block below)

Latency on M5 Pro: small model ~200ms, large-v3-turbo ~400–600ms, large-v3 ~800ms–1.2s behind real-time.

python
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")
buffer = []
chunk_duration = 3
sample_rate = 16000

def callback(indata, frames, time, status):
    buffer.append(indata.copy())
    if len(buffer) * 1024 / sample_rate >= chunk_duration:
        audio = np.concatenate(buffer).flatten().astype(np.float32)
        segments, _ = model.transcribe(audio, beam_size=5)
        for segment in segments:
            print(segment.text)
        buffer.clear()

with sd.InputStream(callback=callback, channels=1, samplerate=sample_rate):
    print("Listening... (Ctrl+C to stop)")
    while True:
        sd.sleep(1000)

Voice Assistant Pipeline: Whisper + Ollama + Piper TTS

Complete code for a local voice assistant running entirely on Apple Silicon.

python
import sounddevice as sd
import numpy as np
import requests
import subprocess
from faster_whisper import WhisperModel

WHISPER_MODEL = "large-v3-turbo"
OLLAMA_URL = "http://localhost:11434/api/chat"
LLM_MODEL = "llama3.1:8b"
SAMPLE_RATE = 16000

whisper = WhisperModel(WHISPER_MODEL, device="cpu", compute_type="int8")

def record_audio(duration=5):
    print("Listening...")
    audio = sd.rec(int(duration * SAMPLE_RATE),
                   samplerate=SAMPLE_RATE,
                   channels=1,
                   dtype=np.float32)
    sd.wait()
    return audio.flatten()

def transcribe(audio):
    segments, _ = whisper.transcribe(audio, beam_size=5)
    return " ".join([seg.text for seg in segments])

def llm_respond(user_text):
    response = requests.post(OLLAMA_URL, json={
        "model": LLM_MODEL,
        "messages": [{"role": "user", "content": user_text}],
        "stream": False
    })
    return response.json()["message"]["content"]

def speak(text):
    subprocess.run(
        ["piper", "--model", "en_US-amy-medium.onnx"],
        input=text.encode(),
        check=True
    )

while True:
    audio = record_audio(duration=5)
    user_text = transcribe(audio)
    print(f"You: {user_text}")
    if not user_text.strip():
        continue
    response = llm_respond(user_text)
    print(f"AI: {response}")
    speak(response)

Best Whisper Configuration by Mac Model

Mac ConfigRecommended ModelReal-time MultipleUse Case
β€”β€”β€”β€”
β€”β€”β€”β€”
β€”β€”β€”β€”
β€”β€”β€”β€”
β€”β€”β€”β€”
β€”β€”β€”β€”
β€”β€”β€”β€”

For real-time voice assistant: use small or large-v3-turbo for lowest latency. For meeting/podcast transcription: use large-v3 for maximum accuracy (1–2 second delay acceptable).

Local Whisper vs Cloud Speech-to-Text Services

MetricWhisper Local (M5 Pro)Google Speech-to-TextOpenAI Whisper APIAssemblyAI
Cost per hour audioβ€”β€”β€”β€”
Accuracy (English WER)β€”β€”β€”β€”
Latencyβ€”β€”β€”β€”
Privacyβ€”β€”β€”β€”
Offline capableβ€”β€”β€”β€”
Languagesβ€”β€”β€”β€”
Setupβ€”β€”β€”β€”

Monthly cost (8 hours/day): Whisper local $3, Google $345, OpenAI $86, AssemblyAI $156. For privacy-sensitive work (medical, legal, journalism), local Whisper is the only option. For high-volume transcription ($100+/month cloud), local Mac pays for itself in 12 months.

Is Whisper faster than cloud APIs?

Local on M5 Pro: 10Γ— real-time (100ms latency). Cloud APIs: 100–500ms latency due to network. Local is faster and free.

Can Whisper handle multiple speakers?

Yes, timestamps separate speakers. Use post-processing or diarization tools to identify speaker identity.

What language support?

99 languages with auto-detect. Accuracy varies by language β€” English is 2.5% WER, other languages 5–15% WER.

Which Whisper model has the best speed-to-quality ratio?

Large-v3-turbo or distil-large-v3. Both achieve ~95% of large-v3 accuracy at 4–6Γ— the speed. Recommended for most real-time use cases.

Can Whisper handle accented English or non-native speakers?

Yes, but WER increases. Native English: ~2.5%. Strong accent/non-native: 5–12%. Large-v3 handles accents better than smaller models.

Does Whisper work for podcasts and music transcription?

Podcasts: yes, excellent for spoken-word. Music with lyrics: poor β€” Whisper is trained for speech. Use specialized models for music.

How accurate is Whisper for technical terminology?

Variable. Common technical terms: good. Highly specialized terms: may transcribe incorrectly. Use --prompt flag with expected vocabulary to improve accuracy.

Can I run multiple Whisper instances on one Mac?

Yes, memory-bound. M5 Pro 36GB: 2 simultaneous large-v3 instances. M5 Max 128GB: 4–6 instances or one instance plus LLM/TTS.

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

Whisper STT on Apple Silicon 2026: Metal Benchmarks M1–M5