Wichtigste Erkenntnisse
- GPU (RTX 3060 8GB): Mistral 7B Q4 at 15 tok/sec. Best speed/quality.
- GPU (RTX 2060 4GB): Mistral 7B Q2 (2-bit) at 20 tok/sec. Acceptable quality, fast.
- CPU (older laptop): Phi 2.7B Q4 at 3 tok/sec. Usable for chat, slow for coding.
- CPU + GPU disabled (battery): TinyLlama 1.1B Q4 at 2 tok/sec. Chat only.
- Speed ranking (fastest to slowest): GPU (RTX) > GPU (iGPU) > CPU (AVX) > CPU (scalar).
- Quality ranking: Mistral 7B > Phi 2.7B > TinyLlama 1.1B.
- Optimal: Quantize larger models (Mistral Q2) over using tiny models. Q2 Mistral > Q4 TinyLlama.
- Cost: All free (open source) vs. ChatGPT API (~$0.002 per 1K tokens).
GPU vs CPU Inference Trade-offs
GPU inference: 15β20 tok/sec on RTX 3060. Requires CUDA setup. Fast, best quality.
iGPU (integrated): 5β8 tok/sec on Intel Iris. No setup needed. Slower than discrete GPU.
CPU inference: 1β5 tok/sec on modern multi-core. Runs everywhere. Slowest.
Rule: If you have any GPU (even integrated), use it. CPU is last resort.
Best Models by Hardware Constraint
| Hardware | Best Model | Speed | Quality | Notes |
|---|---|---|---|---|
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
Quantization: Trading Quality for Speed
Q4 (4-bit): ~1% quality loss, 50% VRAM savings. Standard choice.
Q3 (3-bit): ~3% quality loss, 62% VRAM savings. Acceptable for chat.
Q2 (2-bit): ~10% quality loss, 75% VRAM savings. Risky; use only if OOM.
Speed impact: Q2 is ~30% faster than Q4 due to less memory bandwidth, not computation.
Strategy: Quantize larger models (Mistral 7B Q2) rather than use tiny models (TinyLlama).
Mistral 7B Q2 > TinyLlama 1.1B Q4 in both speed and quality.
CPU-Only Optimization Tricks
- Enable AVX-512: If CPU supports it, use `LLAMACPP_AVX512=1 ollama run phi`. ~20% speedup.
- Reduce context window: Shorter context = faster. Use `--ctx-size 1024` instead of 4096.
- Use llama.cpp instead of Ollama: Slightly faster on CPU (~10% gain) due to less overhead.
- Disable multithreading: Counter-intuitive, but on weak CPUs, single-threaded is faster (no thread overhead).
- Offload to iGPU: Even weak integrated GPU beats CPU. Check `lspci` for GPU availability.
Performance Benchmarks
Real measurements on various hardware (April 2026):
- RTX 3060 12GB + Mistral 7B Q4: 15 tok/sec.
- RTX 2060 4GB + Mistral 7B Q2: 20 tok/sec (aggressive quantization).
- Intel Iris (MacBook Air M1) + Mistral 7B Q4: 8 tok/sec.
- Ryzen 5 5600X CPU + Phi 2.7B Q4: 3 tok/sec.
- Celeron N3050 (old laptop) + TinyLlama 1.1B Q4: 0.5 tok/sec (unusable).
Common Mistakes
- Using TinyLlama on CPU thinking "it's small so it'll be fast". Mistral Q2 is faster and better quality.
- Not enabling CPU acceleration flags (AVX, NEON). 20% speedup available for free.
- Quantizing to Q2 to force 7B into 4GB. Often crashes due to KV cache overhead. Use 3B model instead.
FAQ
Can I run Mistral 7B on a 4GB GPU?
At Q2, yes. At Q4, no (OOM). Q2 has acceptable quality loss (~5β10%).
Is CPU inference usable for chatbots?
Yes, for low-throughput. 3 tok/sec = ~3 min wait per 100 tokens. Not ideal but works.
Should I use Phi 2.7B or TinyLlama 1.1B on CPU?
Phi. Only 0.5 tok/sec slower but much better quality. TinyLlama is "give up" model.
How do I check if my GPU supports CUDA?
Run `nvidia-smi`. No output = no NVIDIA GPU. Check Intel/AMD documentation for integrated GPU.
Can I use quantization below Q2?
Technically yes (Q1), but quality degrades catastrophically. Not recommended.
Is CPU + GPU hybrid inference supported?
Yes, via layer offloading. Llama.cpp: `--n-gpu-layers 10`. Splits model across CPU/GPU.
Sources
- Phi 2.7B model card (Microsoft Research)
- TinyLlama 1.1B documentation (Stability AI)
- Llama.cpp optimization guide: CPU acceleration flags