PromptQuorumPromptQuorum
Accueil/LLMs locaux/Fastest Local LLMs for Low-End PCs
Models by Use Case

Fastest Local LLMs for Low-End PCs

·8 min·Par Hans Kuepper · Fondateur de PromptQuorum, outil de dispatch multi-modèle · PromptQuorum

On sub-8GB GPUs or CPU-only systems, Mistral 7B Q4, Phi 2.7B, and TinyLlama 1.1B are optimized for speed over quality. As of April 2026, CPU inference is 5–10Γ— slower than GPU, but viable for low-latency chat (no waiting). Quantization to Q2 or Q3 enables 3–4B models on 4GB VRAM with acceptable speed.

Points clΓ©s

  • GPU (RTX 3060 8GB): Mistral 7B Q4 at 15 tok/sec. Best speed/quality.
  • GPU (RTX 2060 4GB): Mistral 7B Q2 (2-bit) at 20 tok/sec. Acceptable quality, fast.
  • CPU (older laptop): Phi 2.7B Q4 at 3 tok/sec. Usable for chat, slow for coding.
  • CPU + GPU disabled (battery): TinyLlama 1.1B Q4 at 2 tok/sec. Chat only.
  • Speed ranking (fastest to slowest): GPU (RTX) > GPU (iGPU) > CPU (AVX) > CPU (scalar).
  • Quality ranking: Mistral 7B > Phi 2.7B > TinyLlama 1.1B.
  • Optimal: Quantize larger models (Mistral Q2) over using tiny models. Q2 Mistral > Q4 TinyLlama.
  • Cost: All free (open source) vs. ChatGPT API (~$0.002 per 1K tokens).

GPU vs CPU Inference Trade-offs

GPU inference: 15–20 tok/sec on RTX 3060. Requires CUDA setup. Fast, best quality.

iGPU (integrated): 5–8 tok/sec on Intel Iris. No setup needed. Slower than discrete GPU.

CPU inference: 1–5 tok/sec on modern multi-core. Runs everywhere. Slowest.

Rule: If you have any GPU (even integrated), use it. CPU is last resort.

Best Models by Hardware Constraint

HardwareBest ModelSpeedQualityNotes
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”

Quantization: Trading Quality for Speed

Q4 (4-bit): ~1% quality loss, 50% VRAM savings. Standard choice.

Q3 (3-bit): ~3% quality loss, 62% VRAM savings. Acceptable for chat.

Q2 (2-bit): ~10% quality loss, 75% VRAM savings. Risky; use only if OOM.

Speed impact: Q2 is ~30% faster than Q4 due to less memory bandwidth, not computation.

Strategy: Quantize larger models (Mistral 7B Q2) rather than use tiny models (TinyLlama).

Mistral 7B Q2 > TinyLlama 1.1B Q4 in both speed and quality.

CPU-Only Optimization Tricks

  • Enable AVX-512: If CPU supports it, use `LLAMACPP_AVX512=1 ollama run phi`. ~20% speedup.
  • Reduce context window: Shorter context = faster. Use `--ctx-size 1024` instead of 4096.
  • Use llama.cpp instead of Ollama: Slightly faster on CPU (~10% gain) due to less overhead.
  • Disable multithreading: Counter-intuitive, but on weak CPUs, single-threaded is faster (no thread overhead).
  • Offload to iGPU: Even weak integrated GPU beats CPU. Check `lspci` for GPU availability.

Performance Benchmarks

Real measurements on various hardware (April 2026):

  • RTX 3060 12GB + Mistral 7B Q4: 15 tok/sec.
  • RTX 2060 4GB + Mistral 7B Q2: 20 tok/sec (aggressive quantization).
  • Intel Iris (MacBook Air M1) + Mistral 7B Q4: 8 tok/sec.
  • Ryzen 5 5600X CPU + Phi 2.7B Q4: 3 tok/sec.
  • Celeron N3050 (old laptop) + TinyLlama 1.1B Q4: 0.5 tok/sec (unusable).

Common Mistakes

  • Using TinyLlama on CPU thinking "it's small so it'll be fast". Mistral Q2 is faster and better quality.
  • Not enabling CPU acceleration flags (AVX, NEON). 20% speedup available for free.
  • Quantizing to Q2 to force 7B into 4GB. Often crashes due to KV cache overhead. Use 3B model instead.

FAQ

Can I run Mistral 7B on a 4GB GPU?

At Q2, yes. At Q4, no (OOM). Q2 has acceptable quality loss (~5–10%).

Is CPU inference usable for chatbots?

Yes, for low-throughput. 3 tok/sec = ~3 min wait per 100 tokens. Not ideal but works.

Should I use Phi 2.7B or TinyLlama 1.1B on CPU?

Phi. Only 0.5 tok/sec slower but much better quality. TinyLlama is "give up" model.

How do I check if my GPU supports CUDA?

Run `nvidia-smi`. No output = no NVIDIA GPU. Check Intel/AMD documentation for integrated GPU.

Can I use quantization below Q2?

Technically yes (Q1), but quality degrades catastrophically. Not recommended.

Is CPU + GPU hybrid inference supported?

Yes, via layer offloading. Llama.cpp: `--n-gpu-layers 10`. Splits model across CPU/GPU.

Sources

  • Phi 2.7B model card (Microsoft Research)
  • TinyLlama 1.1B documentation (Stability AI)
  • Llama.cpp optimization guide: CPU acceleration flags

Comparez votre LLM local avec 25+ modèles cloud simultanément avec PromptQuorum.

Essayer PromptQuorum gratuitement β†’

← Retour aux LLMs locaux

Fastest LLMs for Low-End PCs: Sub-8GB GPU, CPU Inference, Quantization | PromptQuorum