Key Takeaways
- GPU (NVIDIA RTX 5090): 200 tokens/sec for 8B models. Best performance, $2,000.
- GPU (NVIDIA RTX 4090): 150 tokens/sec for 8B models. Best value: RTX 4070 Ti at 80 tok/sec for $600.
- Apple Silicon M2 Ultra: 60 tokens/sec for 8B, 35 tok/sec for 70B *natively* (no offloading). Unique advantage: Mac Studio only consumer hardware that runs 70B models without quality loss.
- CPU (Intel i9): 5-6 tokens/sec. Impractical for real-time chat (5-10 second latency).
- For serious work: GPU wins on speed (30β40Γ faster due to memory bandwidth). Apple M2 Ultra wins on large models (native 70B execution).
Performance Comparison: Speed and Throughput
*with offloading to RAM β significant quality degradation
| Hardware | Llama 3.2 8B | Llama 3.3 70B | Qwen2.5 32B | Cost |
|---|---|---|---|---|
| RTX 5090 (GPU, 32 GB) | 200 tok/sec | 50 tok/sec | 70 tok/sec | $2,000 |
| RTX 4090 (GPU, 24 GB) | 150 tok/sec | 10 tok/sec* | 50 tok/sec | $1,800 |
| RTX 4070 Ti (GPU, 12 GB) | 80 tok/sec | Not possible | 25 tok/sec | $600 |
| Mac Studio M2 Ultra (192 GB) | 60 tok/sec | 35 tok/sec | 45 tok/sec | $4,000 |
| MacBook Pro M4 Max (128 GB) | 35 tok/sec | 8 tok/sec* | 22 tok/sec | $4,000 |
| MacBook Pro M5 Max (96 GB) | 25 tok/sec | 5 tok/sec* | 15 tok/sec | $3,500 |
| Intel i9 14900K (CPU only) | 5 tok/sec | 1 tok/sec | 2 tok/sec | $600 |
| AMD Ryzen 9 7950X (CPU only) | 6 tok/sec | 1 tok/sec | 2 tok/sec | $650 |
NVIDIA GPU: The Performance King
NVIDIA GPUs (RTX 40/50 series) are the current best for local LLMs in April 2026. Dominance is due to:
- CUDA ecosystem: 20+ years of AI-specific optimization. Most models optimized for CUDA first.
- Tensor cores: Specialized hardware for matrix operations (the core of LLM inference).
- Memory bandwidth: RTX 5090 has 1,792 GB/sec (GDDR7); RTX 4090 has 1,008 GB/sec; far exceeds unified memory systems.
- Mature software: vLLM, llama.cpp, LM Studio all optimized for NVIDIA. Best inference performance at native precision.
- RTX 5090 (2025 flagship): 200 tok/sec on Llama 3.2 8B, can handle 70B at 50 tok/sec.
Trade-off: High upfront cost ($600-$2000), power consumption (350-575W), requires good cooling and 1200W PSU.
CPU-Only: When and Why to Avoid
CPUs can run LLMs but are impractical for real-time inference:
- Latency: 5-10 seconds per response for 7B models. Unusable for chat.
- Power: CPUs under full load can draw 200W+ (inefficient for inference).
- Context: CPUs scale poorly with long contexts (key-value cache).
CPU is suitable only for batch processing offline (e.g., process documents overnight without real-time response).
Apple Silicon: Unique Strength in Large Models
Apple M-series (M2 Ultra, M3/M4 Max) excel at running large models natively β a unique advantage:
- Unified memory: CPU and GPU share memory pool, eliminating transfer overhead.
- Large model capability: Mac Studio M2 Ultra (192 GB) runs Llama 3.3 70B at 35 tok/sec natively, no offloading. Unique to Apple Silicon.
- Per-watt efficiency: M5 Max handles 7B at 25 tok/sec at just 25W. M4 Max is faster (~35 tok/sec).
- Integration: Native to macOS, no driver issues, works out of box.
- Limitation for GPU: Shared memory means no discrete VRAM upgrade. Model size β€ system RAM.
Mac Studio M2 Ultra (192 GB): 60 tok/sec on 8B, 35 tok/sec on 70B β only consumer hardware with this capability. Research teams running 70B+ should consider Mac Studio.
MacBook Pro: M4 Max (128 GB) at 35 tok/sec for 8B is solid for mobile. M5 Max (96 GB) at 25 tok/sec works for lighter needs.
**For specific M5 Pro and M5 Max benchmarks for local LLM, see our dedicated Apple Silicon M5 comparison β.**
Memory Bandwidth: The Real Speed Bottleneck
LLM inference is memory-bound, not compute-bound. Token generation speed is limited by how fast you can load model weights from memory. Higher memory bandwidth = faster token generation.
The formula: Inference speed β Memory bandwidth Γ· Model weights in memory
- This bandwidth gap explains why GPUs are 30β40Γ faster than CPU for inference.
- Apple Silicon unified memory has lower per-byte bandwidth than NVIDIA GDDR7/GDDR6X, but still 9Γ faster than DDR5 RAM.
- Unified memory advantage: No CPUβGPU transfer overhead. Model stays in one memory pool.
- GPU disadvantage for large models: Limited VRAM (24 GB max for RTX 4090). Offloading to system RAM (89 GB/s) creates 10Γ speed penalty.
- Why Mac Studio M2 Ultra (192 GB unified) is unique: Can fit 70B models natively with 800 GB/s bandwidth β no offloading penalty, no speed cliff.
| Platform | Memory Bandwidth | Effective Speed (8B) |
|---|---|---|
| RTX 5090 (GDDR7) | 1,792 GB/s | 200 tok/sec |
| RTX 4090 (GDDR6X) | 1,008 GB/s | 150 tok/sec |
| RTX 4070 Ti (GDDR6X) | 504 GB/s | 80 tok/sec |
| Mac Studio M2 Ultra (unified) | 800 GB/s | 60 tok/sec |
| MacBook Pro M4 Max (unified) | 546 GB/s | 35 tok/sec |
| MacBook Pro M5 Max (unified) | 400 GB/s | 25 tok/sec |
| DDR5-5600 RAM (CPU only) | 89 GB/s | 5 tok/sec |
| DDR4-3200 RAM (CPU only) | 51 GB/s | 3 tok/sec |
Cost Per Token: True Cost Analysis
Consider the total cost of inference (hardware amortized over time):
| Hardware | Initial Cost | Tokens/Sec | Tokens/Year (24/7) | Long-term Cost |
|---|---|---|---|---|
| RTX 4090 (3-year life) | $1,800 | 150 | 4.7B | $0.0004 per 1M tokens |
| RTX 4070 Ti (3-year) | $600 | 80 | 2.5B | $0.0002 per 1M tokens |
| M5 Max Mac (already owned) | $0 | 25 | 0.79B | $0 per 1M tokens |
| OpenAI API ($0.01 per 1K tokens) | Pay-per-use | Unlimited | Unlimited | $10 per 1M tokens |
When to Choose Each Platform?
Decision framework:
- Choose GPU: You need real-time chat (<1 sec latency), running models 24/7, or batch processing large datasets.
- Choose CPU-only: You are offline, need to batch process documents overnight, or want zero hardware investment.
- Choose Apple Silicon: You own a Mac, run only 7B models, and value low power consumption.
Common Mistakes in Hardware Choice
- Thinking CPU is viable for chat. 5-second latency per response is not practical. User experience is unusable.
- Buying older generation GPU expecting similar performance. RTX 2080 is 10Γ slower than RTX 4070 Ti due to architecture improvements.
- Assuming M5 Max can handle 70B models. It cannot, even at extreme quantization. Limited by unified memory architecture.
- Ignoring power and cooling requirements. RTX 4090 needs 1200W PSU and good case ventilation, not just a "GPU slot".
FAQ
Is GPU or CPU better for running local LLMs?
GPU is significantly better for real-time inference. NVIDIA RTX 4090 runs 7B models at 150 tokens/sec; a high-end CPU like Intel i9 runs the same model at 3β5 tokens/sec. CPU inference produces 5β10 second response latency, making it impractical for interactive chat.
Can Apple Silicon run local LLMs?
Yes. Apple M-series (M3, M4) run 7B models at 25β30 tokens/sec using unified memory β significantly better than CPU-only x86 systems but slower than discrete NVIDIA GPUs. Apple Silicon cannot run 70B models due to unified memory limits (maximum system RAM equals model memory limit).
What is the minimum GPU VRAM for local LLMs?
6 GB VRAM runs 7B models at Q4 quantization (4.1 GB used). 8 GB is the practical minimum for a smooth experience with 7B models at Q5. 16+ GB VRAM is needed for 13B models at full quality. 24 GB handles 30B models.
How much faster is GPU vs CPU for LLM inference?
NVIDIA GPUs are 30β100Γ faster than CPUs for LLM inference. RTX 4090 generates 150 tokens/sec for 7B models; Intel i9 generates 3β5 tokens/sec. The speed gap comes from CUDA parallel processing and dedicated tensor cores, not just clock speed.
Is it worth buying a GPU just for local LLMs?
RTX 4070 Ti (12 GB VRAM, ~$600) amortized over 3 years costs less than OpenAI API fees for heavy users running 2+ hours per day. At 80 tokens/sec it handles real-time chat, coding assistance, and document summarization. Light users (under 30 min/day) are better served by API.
Can I use multiple CPU cores to speed up LLM inference?
More CPU cores help marginally. llama.cpp uses all available threads, but the bottleneck is memory bandwidth (50β100 GB/sec for system RAM vs 2000+ GB/sec for GPU VRAM). More cores do not solve the bandwidth problem β only a GPU or Apple M-series unified memory architecture does.
What is memory bandwidth and why does it matter for LLMs?
LLM inference is memory-bound, not compute-bound. Token generation speed depends on how fast you load model weights from memory. RTX 5090 has 1,792 GB/s (GDDR7); DDR5 RAM has 89 GB/s. This bandwidth gap explains why GPUs are 30β40Γ faster than CPU for inference.
Which Apple Silicon chip is best for local LLMs?
Mac Studio M2 Ultra (192 GB) for running 70B models natively at 35 tok/sec β unique advantage no consumer GPU can match. MacBook Pro M4 Max (128 GB) for portable use at 35 tok/sec on 8B models. M5 Max (96 GB) works for 7β13B models. Avoid base M4/M3 (8 GB RAM) for serious LLM work.
Can Apple Silicon run 70B models?
Mac Studio M2 Ultra with 192 GB unified memory runs Llama 3.3 70B at 35 tok/sec natively, without offloading. This is unique β no consumer GPU can do this. Smaller Mac models (M5 Max, M4 Max) partially offload to RAM, creating 5β10Γ speed penalty. Full 70B quality only on Mac Studio M2 Ultra.
Is RTX 5090 worth the $2,000 for local LLMs?
Only if running 70B models regularly or production workloads. RTX 5090 (200 tok/sec on 8B) is 2.5Γ faster than RTX 4090 ($1,800). Better value: RTX 4070 Ti ($600, 80 tok/sec) for 8Bβ32B models; Mac Studio M2 Ultra ($4,000) if you need native 70B support.
Sources
- NVIDIA GPU Specifications β RTX 40/50 series GPU specs, VRAM, memory bandwidth.
- Apple M3 Performance β M5 Max unified memory architecture and inference performance.
- vLLM Benchmarks β Production LLM inference throughput benchmarks.
- Different hardware produces different token rates, but all inference benefits from structured prompts. Long-context requests require different techniques than short ones: context windows explained covers strategies for any hardware.