Points clΓ©s
- GPU (NVIDIA RTX 4090): 150 tokens/sec for 7B models. Best performance, highest cost ($1800).
- CPU (Intel i9): 3β5 tokens/sec for 7B models. Free (you have one), unusable latency.
- Apple Silicon M3 Max: 25β30 tokens/sec for 7B models. Good middle ground, optimized for Mac architecture.
- For any serious use, GPU is non-negotiable. CPU-only is impractical (5β10 second latency per response).
- As of April 2026, NVIDIA dominates. Apple Silicon is catching up but still trails.
Performance Comparison: Speed and Throughput
| Hardware | Llama 7B | Llama 13B | Qwen 32B | Cost |
|---|---|---|---|---|
| RTX 4090 (GPU) | 150 tok/sec | 100 tok/sec | 50 tok/sec | $1800 |
| RTX 4080 (GPU) | 100 tok/sec | 70 tok/sec | 35 tok/sec | $1200 |
| RTX 4070 Ti (GPU) | 80 tok/sec | 50 tok/sec | 25 tok/sec | $600 |
| M3 Max Mac (GPU) | 25 tok/sec | 15 tok/sec | N/A | Included in Mac |
| Intel i9 (CPU only) | 5 tok/sec | 2 tok/sec | 1 tok/sec | Included |
| AMD Ryzen 9 (CPU only) | 4 tok/sec | 2 tok/sec | 0.5 tok/sec | Included |
NVIDIA GPU: The Performance King
NVIDIA GPUs (RTX 40/50 series) are the current best for local LLMs. Dominance is due to:
- CUDA ecosystem: 20+ years of AI-specific optimization.
- Tensor cores: Specialized hardware for matrix operations (the core of LLM inference).
- Memory bandwidth: 2000+ GB/sec (critical for large models).
- Mature software: vLLM, llama.cpp, all optimized for NVIDIA.
Trade-off: High upfront cost ($600β$1800), power consumption (350β575W), and requires good cooling.
CPU-Only: When and Why to Avoid
CPUs can run LLMs but are impractical for real-time inference:
- Latency: 5β10 seconds per response for 7B models. Unusable for chat.
- Power: CPUs under full load can draw 200W+ (inefficient for inference).
- Context: CPUs scale poorly with long contexts (key-value cache).
CPU is suitable only for batch processing offline (e.g., process documents overnight without real-time response).
Apple Silicon: Good for Mac, but GPU Wins Overall
Apple M-series (M3, M4) are surprisingly capable for a CPU:
- Unified memory: CPU and GPU share memory, eliminating transfers.
- Per-watt efficiency: M3 Max handles 7B models decently (~25 tok/sec) at low power (25W for model inference).
- Integration: Native to macOS, no driver issues.
- Limitation: No discrete VRAM upgrade path. Limited to model size = system RAM.
M3 Max is excellent for Mac users running 7β13B models. For 70B models, Mac is not an option.
Cost Per Token: True Cost Analysis
Consider the total cost of inference (hardware amortized over time):
| Hardware | Initial Cost | Tokens/Sec | Tokens/Year (24/7) | Long-term Cost |
|---|---|---|---|---|
| RTX 4090 (3-year life) | β | 150 | β | β |
| RTX 4070 Ti (3-year) | β | 80 | β | β |
| M3 Max Mac (already owned) | β | 25 | β | β |
| OpenAI API ($0.01 per 1K tokens) | β | Unlimited | β | β |
When to Choose Each Platform
Decision framework:
- Choose GPU: You need real-time chat (<1 sec latency), running models 24/7, or batch processing large datasets.
- Choose CPU-only: You are offline, need to batch process documents overnight, or want zero hardware investment.
- Choose Apple Silicon: You own a Mac, run only 7B models, and value low power consumption.
Common Mistakes in Hardware Choice
- Thinking CPU is viable for chat. 5-second latency per response is not practical. User experience is unusable.
- Buying older generation GPU expecting similar performance. RTX 2080 is 10Γ slower than RTX 4070 Ti due to architecture improvements.
- Assuming M3 Max can handle 70B models. It cannot, even at extreme quantization. Limited by unified memory architecture.
- Ignoring power and cooling requirements. RTX 4090 needs 1200W PSU and good case ventilation, not just a "GPU slot".
Sources
- NVIDIA GPU Specifications β nvidia.com/en-us/geforce
- Apple M3 Performance β apple.com/mac/m3
- vLLM Benchmarks β github.com/vllm-project/vllm/tree/main/benchmarks