Key Takeaways
- 4 GB RAM, CPU only: Qwen3 1.7B Q4_K_M — 25–40 tok/s. Fastest response on minimal hardware.
- 8 GB RAM, CPU only (sweet spot): Phi-4-mini 3.8B Q4_K_M — 15–25 tok/s. Coding and reasoning on old laptops.
- 8 GB RAM + Intel Iris iGPU: Qwen3 4B — 12–20 tok/s with partial GPU offload.
- 16 GB RAM, CPU only: Qwen3 8B Q4_K_M — 8–15 tok/s. Strong quality, no GPU needed.
- 16 GB RAM + iGPU: Llama 3.2 3B or Qwen3 4B — 20–35 tok/s with layer offload.
- Winner verdict: For most low-end PCs, Phi-4-mini (3.8B) at Q4_K_M is the sweet spot — fits 8 GB RAM, 15-25 tok/s on CPU. Drop to Qwen3 1.7B for the absolute fastest response.
- Cost: All free (open source) vs. ChatGPT API (~$0.002 per 1K tokens).
📍 In One Sentence
On a CPU-only PC with 8 GB RAM, Phi-4-mini 3.8B Q4_K_M runs at 15–25 tok/s and handles coding and reasoning; on 4 GB RAM, Qwen3 1.7B Q4_K_M hits 25–40 tok/s.
💬 In Plain Terms
You don't need a gaming GPU to run a local AI. These models run entirely on your CPU and regular RAM. Smaller models (1–4B parameters) are surprisingly capable for everyday tasks, and they're fast enough for a real conversation.
What is the Fastest Model for Your Hardware?
Match your hardware to the right model — the wrong choice leaves 4–10× speed on the table. All tiers below are CPU-only unless noted.
| Your Hardware | Recommended Model | Ollama Command | Expected Speed |
|---|---|---|---|
| 4 GB RAM, CPU only | Qwen3 1.7B Q4_K_M | ollama run qwen3:1.7b | 25–40 tok/s |
| 8 GB RAM, CPU only | Phi-4-mini 3.8B Q4_K_M | ollama run phi4-mini | 15–25 tok/s |
| 8 GB RAM + Intel Iris iGPU | Qwen3 4B Q4_K_M | ollama run qwen3:4b | 12–20 tok/s |
| 16 GB RAM, CPU only | Qwen3 8B Q4_K_M | ollama run qwen3:8b | 8–15 tok/s |
| 16 GB RAM + iGPU | Llama 3.2 3B Q4_K_M | ollama run llama3.2:3b | 20–35 tok/s |
Which Model Should You Use?
Match your situation to the right model — this is the single most important decision:
- 8 GB RAM laptop (no discrete GPU): Phi-4-mini (3.8B) at Q4_K_M — 15–25 tok/s, handles coding and reasoning. `ollama run phi4-mini`
- 4 GB RAM, very old PC: Qwen3 1.7B Q4_K_M — 25–40 tok/s, fastest response on minimal RAM. `ollama run qwen3:1.7b`
- 16 GB RAM, no GPU: Qwen3 8B Q4_K_M — 8–15 tok/s, strong quality. `ollama run qwen3:8b`
- 8 GB RAM + Intel Iris iGPU: Qwen3 4B — use `OLLAMA_NUM_GPU=1` for partial offload, 12–20 tok/s. `ollama run qwen3:4b`
- Want multilingual (128K context): Qwen3 4B or Llama 3.2 3B — both support 128K context on Ollama.
- For per-RAM-tier picks and thermals, see how to run a local LLM on a laptop.
Which Local LLM Should You Run on Your Hardware?
All tiers below are CPU-only or iGPU. Choose the largest model that fits your RAM at Q4_K_M — quantization degrades quality less than dropping to a smaller model.
| Hardware | Model | Quant | Speed | Experience |
|---|---|---|---|---|
| 4 GB RAM, CPU only | Qwen3 1.7B | Q4_K_M | 25–40 t/s | fast, usable quality |
| 8 GB RAM, CPU only | Phi-4-mini 3.8B | Q4_K_M | 15–25 t/s | coding + reasoning |
| 8 GB RAM + Iris iGPU | Qwen3 4B | Q4_K_M | 12–20 t/s | partial GPU offload |
| 16 GB RAM, CPU only | Qwen3 8B | Q4_K_M | 8–15 t/s | strong quality |
| 16 GB RAM + iGPU | Llama 3.2 3B | Q4_K_M | 20–35 t/s | smooth on iGPU |
GPU vs CPU for Local LLMs: Which Is Faster on Low-End Hardware?
GPU inference: 15-20 tok/sec on RTX 3060. Requires CUDA setup. Fast, best quality. See budget GPU guide for cost-effective options.
iGPU (integrated): 5-8 tok/sec on Intel Iris. No setup needed. Slower than discrete GPU.
CPU inference: 1-5 tok/sec on modern multi-core. Runs everywhere. Slowest.
Rule: If you have any GPU (even integrated), use it. CPU is last resort.
Why Smaller Models Are Faster on Low-End PCs
Model size directly determines speed. A 1B–3B model fits entirely in system RAM, allowing the CPU or GPU to stream data continuously. Larger models require memory swapping — moving data between RAM and disk — which slows generation by 10–100× (the bottleneck is disk I/O, not compute).
The hardware decision table above reflects this principle: TinyLlama 1.1B (1B params) reaches 5–10 tok/sec on old CPUs, while 13B+ models are impractical on low-end hardware because swapping dominates.
- 1B–3B models: Fit in 4–8 GB RAM → fastest generation → acceptable quality
- 7B models: Borderline on 8 GB systems → slower due to memory pressure → high quality
- 13B+ models: Require 16+ GB VRAM or swap heavily → too slow for interactive use
How Fast Are Local LLMs on Low-End PCs?
On CPU-only systems, expect:
- 3B models → 15–40 tokens/sec (older CPUs: 10–15, newer CPUs with optimization: 30–40)
- 7B models → 10–25 tokens/sec (depends on CPU cores and quantization; with aggressive optimization some reach 30+)
- This is slower than cloud APIs (ChatGPT 4o: 80–150 tok/sec) but sufficient for interactive use. A 3B model at 25 tok/sec generates a 500-token response in 20 seconds — acceptable for non-time-critical tasks like code review, summarization, and creative writing.
How Does Quantization Affect Speed on Low-End PCs?
Q4 (4-bit): ~1% quality loss, 50% VRAM savings. Standard choice. For details on all quantization levels and how they work, see the full guide.
Q3 (3-bit): ~3% quality loss, 62% VRAM savings. Acceptable for chat.
Q2 (2-bit): ~10% quality loss, 75% VRAM savings. Risky; use only if OOM.
Speed impact: Q2 is ~30% faster than Q4 due to less memory bandwidth, not computation.
Strategy: Quantize larger models (Mistral Small Q2) rather than use tiny models (TinyLlama).
Mistral Small Q2 > TinyLlama 1.1B Q4 in both speed and quality.
Faster models trade quality for speed — but tuning temperature and top-p recovers much of that quality loss. Lower temperature (0.1–0.3) on fast models produces more consistent output than default settings. See temperature and top-p explained for the exact settings.
How Do You Speed Up CPU-Only Inference?
- Enable AVX-512: If CPU supports it, use `LLAMACPP_AVX512=1 ollama run phi`. ~20% speedup.
- Reduce context window: Shorter context = faster. Use `--ctx-size 1024` instead of 4096.
- **Use llama.cpp instead of Ollama:** Slightly faster on CPU (~10% gain) due to less overhead.
- Disable multithreading: Counter-intuitive, but on weak CPUs, single-threaded is faster (no thread overhead).
- Offload to iGPU: Even weak integrated GPU beats CPU. Check `lspci` for GPU availability.
How Fast Are These Models? Real Benchmarks (June 2026)
Real measurements by hardware tier, June 2026. All running Ollama with default settings, no tuning. All CPU-only or iGPU — no discrete GPU:
- 4 GB RAM, CPU only (Intel N100 mini PC) + Qwen3 1.7B Q4_K_M: 25–35 tok/s. `ollama run qwen3:1.7b`
- 8 GB RAM, CPU only (Core i5-1235U) + Phi-4-mini Q4_K_M: 15–22 tok/s. `ollama run phi4-mini`
- 8 GB RAM, CPU only (Ryzen 5 5600G with Radeon iGPU) + Qwen3 4B Q4_K_M: 18–25 tok/s with layer offload.
- 8 GB RAM + Intel Iris Xe (12th gen i5) + Qwen3 4B Q4_K_M: 12–18 tok/s. `ollama run qwen3:4b`
- 16 GB RAM, CPU only (Ryzen 7 7700X) + Qwen3 8B Q4_K_M: 8–13 tok/s. `ollama run qwen3:8b`
- 16 GB RAM + Iris Xe iGPU + Llama 3.2 3B Q4_K_M: 20–30 tok/s. `ollama run llama3.2:3b`
What is Actually "Fast" for Local LLMs?
Speed feels different depending on the task — use this as your reference:
If your model runs below 15 tok/sec, downgrade model size (7B → 3B) or drop one quantization level (Q5 → Q4) before buying new hardware.
- Below 10 tok/sec → feels broken. Words appear one at a time with noticeable pauses. Unusable for interactive chat.
- 15–25 tok/sec → acceptable. Readable speed for most users. Good for Q&A, summaries, and coding help.
- 30+ tok/sec → smooth. Feels like a real assistant. Comfortable for all interactive tasks.
- 60+ tok/sec → instant. Faster than you can read. Ideal for real-time autocomplete and rapid iteration.
What to Avoid on Low-End PCs
- Do not run 13B+ models — they exceed RAM limits. A 13B model at Q4 requires 8–10 GB VRAM, pushing beyond practical low-end PC capacity. Even with aggressive Q2 quantization, 13B models require 5–6 GB, leaving insufficient headroom for OS and GPU scheduling overhead. Stick to 7B and below.
- Avoid Q8 quantization — slower with minimal quality gain. Q8 uses nearly 2× the VRAM of Q4 (8 GB vs 5.5 GB for Mistral Small) while delivering only ~2% quality improvement. For 4 GB systems, Q8 is impractical; for 8 GB systems, Q4 remains optimal. Q3 is the only trade-off worth considering when Q4 OOMs.
- Do not expect real-time autocomplete performance. At 3 tok/sec on CPU, generating 50 tokens takes 16 seconds. Interactive autocomplete requires ≥20 tok/sec. Local LLMs on low-end CPUs work for batch chat, drafting, and review — not live autocomplete or code-as-you-type scenarios.
- Do not use CPU-only inference for production chatbots. Acceptable for internal tools, prototypes, and offline batch work. Cloud APIs (15–20 ms latency) outperform low-end CPU (300+ ms latency) for user-facing services. Use local inference for privacy-critical or offline scenarios, not speed-critical ones.
Common Mistakes
- Mistake: Using TinyLlama on CPU for better speed. Problem: TinyLlama belongs on 4 GB VRAM, not CPU — Phi-4 Mini 3.8B is faster and far better on CPU-only hardware. Fix: Run Phi-4 Mini 3.8B on CPU; keep TinyLlama Q5 for 4 GB VRAM.
- Mistake: Not enabling CPU acceleration flags. Problem: Missing AVX/NEON enables 20% speedup without cost. Fix: Set `LLAMACPP_AVX512=1` or `LLAMACPP_NEON=1` before running Ollama.
- Mistake: Quantizing to Q2 to force 7B into 4GB. Problem: Q2 quantization often causes out-of-memory crashes due to KV cache overhead during inference. Fix: Use a 3B model at Q4 instead.
- Mistake: Assuming newer hardware always means faster inference. Problem: Desktop Ryzen is not faster per-token than mobile ARM because desktop software lacks memory optimization. Fix: Benchmark your actual hardware.
- Mistake: Using the wrong Ollama slug for your model. Problem: `ollama run phi` loads Phi-2, not Phi-4 Mini. Fix: Use `ollama run phi4-mini` for the latest Phi model. Always check ollama.com/library for exact model tags.
Local LLMs on Low-End PCs: Regional Context
EU / GDPR: Local on low-end hardware: no inference data leaves the device — for many SMEs and freelancers a technically straightforward way to avoid Art. 44 GDPR transfer risks. The EU AI Act (effective February 2025) does not impose documentation requirements on personal-use inference. For German SMEs using local LLMs for internal business tasks, BSI-Grundschutz recommends local inference for sensitive document processing. Overall data protection compliance still depends on your full operational setup, not the inference architecture alone.
Japan: METI AI Governance Guidelines encourage data minimization. CPU inference on low-end hardware, while slow, satisfies the strictest data sovereignty requirements — no API calls, no logging, no third-party data access. For Japanese users running Qwen3 on CPU for Japanese-language tasks, throughput of 1–3 tok/sec is acceptable for non-time-critical document summarization.
China: Local inference on consumer hardware is common for Qwen3 and DeepSeek-R1 deployments in China, where cloud API access to non-Chinese models is restricted. Qwen3 1.5B and 3B run on CPU-only hardware, providing a functional alternative to cloud APIs for users with constrained hardware.
Common Questions About Running Local LLMs on Low-End PCs
What qualifies as a low-end PC for running local LLMs?
A low-end PC for local LLMs is any machine with less than 8GB of dedicated VRAM, or a CPU-only system. This includes most laptops with Intel Iris or AMD Radeon integrated graphics, desktop PCs with GTX 1060 or older GPUs, and Chromebooks. The key constraint is not the CPU speed but the memory available to hold model weights.
Can I run Mistral Small on a 4GB GPU?
At Q2 quantization, yes. At Q4, no (OOM crash). Q2 has acceptable quality loss (~5-10% lower MMLU score), but speed increases by 30%. This is a practical trade-off for users with limited VRAM.
Is CPU inference usable for chatbots?
Yes, for low-throughput async scenarios. At 3 tok/sec, a 100-token response takes ~3 minutes. This is unusable for interactive conversation but acceptable for overnight batch processing or non-real-time tasks like email drafting.
Should I use Phi-4 Mini or TinyLlama 1.1B on CPU?
Phi-4 Mini 3.8B is the better choice for CPU-only systems — it hits 5–15 tok/sec and produces significantly better output quality than TinyLlama. TinyLlama 1.1B Q5 is optimized for 4 GB VRAM (20–40 tok/sec), not for CPU-only inference.
How do I check if my GPU supports CUDA?
Run `nvidia-smi` in terminal. If it prints GPU info, you have CUDA support. If it returns "command not found" or "no NVIDIA GPU", check Intel/AMD documentation for integrated GPU drivers.
How does quantization affect inference speed?
Quantization primarily reduces memory bandwidth requirements, not computation. Q2 (2-bit) is about 30% faster than Q4 (4-bit) because the model loads fewer bytes per forward pass. However, Q2 carries a ~10% quality penalty. The practical rule: use Q4 as default, drop to Q2 only if you cannot fit the model in available VRAM at Q4.
Can I use quantization below Q2?
Technically yes (Q1), but quality degrades catastrophically — up to 30% loss in accuracy. Not recommended for any practical use case.
Is CPU + GPU hybrid inference supported?
Yes, via layer offloading. With llama.cpp you can use `--n-gpu-layers 10` to offload the first 10 layers to GPU while keeping the rest on CPU. This hybrid approach gives you speed closer to GPU on limited VRAM.
What is the fastest local LLM?
The fastest models are 1B–3B parameter models like Llama 3.2 3B, which can reach 15–40 tokens/sec on optimized modern CPUs and up to 40–60 tok/sec with GPU acceleration. Speed depends more on hardware than model choice — a 7B on GPU (25–40 tok/sec) outpaces a 3B on CPU (10–25 tok/sec).
Can I run a local LLM on 4 GB RAM?
Yes — 1B models run comfortably on 4 GB systems (1–1.3 GB per model + 2–3 GB for OS and headroom). Larger models require more: 3B needs 2–3 GB, 7B needs 5.5–8 GB at Q4. For 4 GB systems, Llama 3.2 1B or TinyLlama 1.1B are practical choices, but quality is limited.
Is GPU required for speed?
No, but GPUs significantly increase speed. CPU-only systems can reach 10–25 tok/sec for 3B models with optimization; GPUs reach 25–60 tok/sec. For CPU-only users, smaller models (1B–3B) are essential. GPU is required only if you need interactive speeds on 7B+ models.
Sources
- Phi-4 Mini Model Card — Microsoft Research. 68% MMLU, 70% HumanEval. Released 2025.
- Gemma 3 Model Card — Google DeepMind. Gemma 3 2B with 128K context window. Released 2025.
- Llama 4 Scout 8B — Meta. 10M context window, released March 2026.
- TinyLlama 1.1B Repository — Stability AI. Training completed 2024. Model stable, no longer receiving updates. Still recommended for 4 GB VRAM tier.
- llama.cpp CPU Optimization Guide — CPU acceleration flags including AVX-512, NEON, and thread configuration.