Skip to main content
PromptQuorumPromptQuorum
Home/Local LLMs/Fastest Local LLMs for Low-End PCs 2026: CPU-Only Guide
Models by Use Case

Fastest Local LLMs for Low-End PCs 2026: CPU-Only Guide

·8 min·By Hans Kuepper · Founder of PromptQuorum, multi-model AI dispatch tool · PromptQuorum

The fastest local LLM on a low-end PC (CPU only, 8 GB RAM) is Qwen3 1.7B at Q4_K_M — 25-40 tok/s on a modern i5/Ryzen 5. For usable quality on 8 GB, Phi-4-mini (3.8B) runs at 15-25 tok/s and handles coding and reasoning well. Every model in this guide runs without a GPU.

CPU only, 8 GB RAM: Qwen3 1.7B Q4_K_M hits 25–40 tok/s on a modern i5/Ryzen 5. Phi-4-mini (3.8B) runs 15–25 tok/s and handles coding and reasoning. Every model in this guide runs without a GPU. As of June 2026, you can run a capable local LLM on an old laptop with no discrete GPU. This guide covers hardware tiers from 4 GB CPU-only up to 16 GB with an Intel Iris iGPU — with Ollama commands for each.

Fastest Local LLMs for Low-End PCs (2026)

Speed depends on your hardware tier. Match your CPU/RAM to the right model — the wrong choice leaves 4–10× speed on the table.

  • 4 GB RAM, CPU only: Qwen3 1.7B Q4_K_M — 25–40 tok/s on modern i5/Ryzen 5. `ollama run qwen3:1.7b`
  • 8 GB RAM, CPU only: Phi-4-mini (3.8B) — 15–25 tok/s, handles coding and reasoning. `ollama run phi4-mini`
  • 8 GB RAM + Intel Iris iGPU: Qwen3 4B — 12–20 tok/s with partial offload. `ollama run qwen3:4b`
  • 16 GB RAM, CPU only: Qwen3 8B Q4_K_M — 8–15 tok/s, strong quality. `ollama run qwen3:8b`

For most low-end PCs, Phi-4-mini (3.8B) at Q4_K_M is the sweet spot — fits 8 GB RAM, 15-25 tok/s on CPU. Drop to Qwen3 1.7B for the absolute fastest response.

Slide Deck: Fastest Local LLMs for Low-End PCs 2026: CPU-Only Guide

Interactive 14-slide deck covering fastest local LLMs for low-end PCs: CPU-only (5–15 tok/sec), 4 GB GPU (20–40 tok/sec), 8 GB GPU sweet spot (25–60 tok/sec). Slides include hardware-to-model decision table, one-pick-per-tier recommendations with RAM/VRAM numbers, quantization guide (Q4/Q3/Q2), speed perception thresholds, and common mistakes. Download the PDF as a local LLM hardware reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • 4 GB RAM, CPU only: Qwen3 1.7B Q4_K_M — 25–40 tok/s. Fastest response on minimal hardware.
  • 8 GB RAM, CPU only (sweet spot): Phi-4-mini 3.8B Q4_K_M — 15–25 tok/s. Coding and reasoning on old laptops.
  • 8 GB RAM + Intel Iris iGPU: Qwen3 4B — 12–20 tok/s with partial GPU offload.
  • 16 GB RAM, CPU only: Qwen3 8B Q4_K_M — 8–15 tok/s. Strong quality, no GPU needed.
  • 16 GB RAM + iGPU: Llama 3.2 3B or Qwen3 4B — 20–35 tok/s with layer offload.
  • Winner verdict: For most low-end PCs, Phi-4-mini (3.8B) at Q4_K_M is the sweet spot — fits 8 GB RAM, 15-25 tok/s on CPU. Drop to Qwen3 1.7B for the absolute fastest response.
  • Cost: All free (open source) vs. ChatGPT API (~$0.002 per 1K tokens).

📍 In One Sentence

On a CPU-only PC with 8 GB RAM, Phi-4-mini 3.8B Q4_K_M runs at 15–25 tok/s and handles coding and reasoning; on 4 GB RAM, Qwen3 1.7B Q4_K_M hits 25–40 tok/s.

💬 In Plain Terms

You don't need a gaming GPU to run a local AI. These models run entirely on your CPU and regular RAM. Smaller models (1–4B parameters) are surprisingly capable for everyday tasks, and they're fast enough for a real conversation.

What is the Fastest Model for Your Hardware?

Match your hardware to the right model — the wrong choice leaves 4–10× speed on the table. All tiers below are CPU-only unless noted.

Your HardwareRecommended ModelOllama CommandExpected Speed
4 GB RAM, CPU onlyQwen3 1.7B Q4_K_Mollama run qwen3:1.7b25–40 tok/s
8 GB RAM, CPU onlyPhi-4-mini 3.8B Q4_K_Mollama run phi4-mini15–25 tok/s
8 GB RAM + Intel Iris iGPUQwen3 4B Q4_K_Mollama run qwen3:4b12–20 tok/s
16 GB RAM, CPU onlyQwen3 8B Q4_K_Mollama run qwen3:8b8–15 tok/s
16 GB RAM + iGPULlama 3.2 3B Q4_K_Mollama run llama3.2:3b20–35 tok/s

Which Model Should You Use?

Match your situation to the right model — this is the single most important decision:

  • 8 GB RAM laptop (no discrete GPU): Phi-4-mini (3.8B) at Q4_K_M — 15–25 tok/s, handles coding and reasoning. `ollama run phi4-mini`
  • 4 GB RAM, very old PC: Qwen3 1.7B Q4_K_M — 25–40 tok/s, fastest response on minimal RAM. `ollama run qwen3:1.7b`
  • 16 GB RAM, no GPU: Qwen3 8B Q4_K_M — 8–15 tok/s, strong quality. `ollama run qwen3:8b`
  • 8 GB RAM + Intel Iris iGPU: Qwen3 4B — use `OLLAMA_NUM_GPU=1` for partial offload, 12–20 tok/s. `ollama run qwen3:4b`
  • Want multilingual (128K context): Qwen3 4B or Llama 3.2 3B — both support 128K context on Ollama.
  • For per-RAM-tier picks and thermals, see how to run a local LLM on a laptop.

Which Local LLM Should You Run on Your Hardware?

All tiers below are CPU-only or iGPU. Choose the largest model that fits your RAM at Q4_K_M — quantization degrades quality less than dropping to a smaller model.

HardwareModelQuantSpeedExperience
4 GB RAM, CPU onlyQwen3 1.7BQ4_K_M25–40 t/sfast, usable quality
8 GB RAM, CPU onlyPhi-4-mini 3.8BQ4_K_M15–25 t/scoding + reasoning
8 GB RAM + Iris iGPUQwen3 4BQ4_K_M12–20 t/spartial GPU offload
16 GB RAM, CPU onlyQwen3 8BQ4_K_M8–15 t/sstrong quality
16 GB RAM + iGPULlama 3.2 3BQ4_K_M20–35 t/ssmooth on iGPU
Local LLM speed by hardware tier (CPU-only and iGPU): 4 GB RAM (25–40 tok/s, Qwen3 1.7B), 8 GB RAM CPU (15–25 tok/s, Phi-4-mini), 8 GB + Iris iGPU (12–20 tok/s), 16 GB CPU (8–15 tok/s), 16 GB + iGPU (20–35 tok/s). June 2026 benchmarks.
Local LLM speed by hardware tier (CPU-only and iGPU): 4 GB RAM (25–40 tok/s, Qwen3 1.7B), 8 GB RAM CPU (15–25 tok/s, Phi-4-mini), 8 GB + Iris iGPU (12–20 tok/s), 16 GB CPU (8–15 tok/s), 16 GB + iGPU (20–35 tok/s). June 2026 benchmarks.

GPU vs CPU for Local LLMs: Which Is Faster on Low-End Hardware?

GPU inference: 15-20 tok/sec on RTX 3060. Requires CUDA setup. Fast, best quality. See budget GPU guide for cost-effective options.

iGPU (integrated): 5-8 tok/sec on Intel Iris. No setup needed. Slower than discrete GPU.

CPU inference: 1-5 tok/sec on modern multi-core. Runs everywhere. Slowest.

Rule: If you have any GPU (even integrated), use it. CPU is last resort.

CPU vs GPU speed comparison for local LLMs: CPU-only reaches 10–25 tok/sec (3B models) and 15–40 tok/sec. GPU (RTX 3060, 8 GB) hits 25–60 tok/sec — 4–10× faster than CPU-only inference.
CPU vs GPU speed comparison for local LLMs: CPU-only reaches 10–25 tok/sec (3B models) and 15–40 tok/sec. GPU (RTX 3060, 8 GB) hits 25–60 tok/sec — 4–10× faster than CPU-only inference.

Why Smaller Models Are Faster on Low-End PCs

Model size directly determines speed. A 1B–3B model fits entirely in system RAM, allowing the CPU or GPU to stream data continuously. Larger models require memory swapping — moving data between RAM and disk — which slows generation by 10–100× (the bottleneck is disk I/O, not compute).

The hardware decision table above reflects this principle: TinyLlama 1.1B (1B params) reaches 5–10 tok/sec on old CPUs, while 13B+ models are impractical on low-end hardware because swapping dominates.

  • 1B–3B models: Fit in 4–8 GB RAM → fastest generation → acceptable quality
  • 7B models: Borderline on 8 GB systems → slower due to memory pressure → high quality
  • 13B+ models: Require 16+ GB VRAM or swap heavily → too slow for interactive use

How Fast Are Local LLMs on Low-End PCs?

On CPU-only systems, expect:

  • 3B models → 15–40 tokens/sec (older CPUs: 10–15, newer CPUs with optimization: 30–40)
  • 7B models → 10–25 tokens/sec (depends on CPU cores and quantization; with aggressive optimization some reach 30+)
  • This is slower than cloud APIs (ChatGPT 4o: 80–150 tok/sec) but sufficient for interactive use. A 3B model at 25 tok/sec generates a 500-token response in 20 seconds — acceptable for non-time-critical tasks like code review, summarization, and creative writing.

How Does Quantization Affect Speed on Low-End PCs?

Q4 (4-bit): ~1% quality loss, 50% VRAM savings. Standard choice. For details on all quantization levels and how they work, see the full guide.

Q3 (3-bit): ~3% quality loss, 62% VRAM savings. Acceptable for chat.

Q2 (2-bit): ~10% quality loss, 75% VRAM savings. Risky; use only if OOM.

Speed impact: Q2 is ~30% faster than Q4 due to less memory bandwidth, not computation.

Strategy: Quantize larger models (Mistral Small Q2) rather than use tiny models (TinyLlama).

Mistral Small Q2 > TinyLlama 1.1B Q4 in both speed and quality.

Faster models trade quality for speed — but tuning temperature and top-p recovers much of that quality loss. Lower temperature (0.1–0.3) on fast models produces more consistent output than default settings. See temperature and top-p explained for the exact settings.

Quantization trade-offs for local LLMs: Q4 (1% quality loss, 50% VRAM savings, 4.5 GB for Mistral Small) is the standard. Q2 is 30% faster but 10% quality drop. Avoid Q8 — 2× VRAM cost with minimal gain.
Quantization trade-offs for local LLMs: Q4 (1% quality loss, 50% VRAM savings, 4.5 GB for Mistral Small) is the standard. Q2 is 30% faster but 10% quality drop. Avoid Q8 — 2× VRAM cost with minimal gain.

How Do You Speed Up CPU-Only Inference?

  • Enable AVX-512: If CPU supports it, use `LLAMACPP_AVX512=1 ollama run phi`. ~20% speedup.
  • Reduce context window: Shorter context = faster. Use `--ctx-size 1024` instead of 4096.
  • **Use llama.cpp instead of Ollama:** Slightly faster on CPU (~10% gain) due to less overhead.
  • Disable multithreading: Counter-intuitive, but on weak CPUs, single-threaded is faster (no thread overhead).
  • Offload to iGPU: Even weak integrated GPU beats CPU. Check `lspci` for GPU availability.

How Fast Are These Models? Real Benchmarks (June 2026)

Real measurements by hardware tier, June 2026. All running Ollama with default settings, no tuning. All CPU-only or iGPU — no discrete GPU:

  • 4 GB RAM, CPU only (Intel N100 mini PC) + Qwen3 1.7B Q4_K_M: 25–35 tok/s. `ollama run qwen3:1.7b`
  • 8 GB RAM, CPU only (Core i5-1235U) + Phi-4-mini Q4_K_M: 15–22 tok/s. `ollama run phi4-mini`
  • 8 GB RAM, CPU only (Ryzen 5 5600G with Radeon iGPU) + Qwen3 4B Q4_K_M: 18–25 tok/s with layer offload.
  • 8 GB RAM + Intel Iris Xe (12th gen i5) + Qwen3 4B Q4_K_M: 12–18 tok/s. `ollama run qwen3:4b`
  • 16 GB RAM, CPU only (Ryzen 7 7700X) + Qwen3 8B Q4_K_M: 8–13 tok/s. `ollama run qwen3:8b`
  • 16 GB RAM + Iris Xe iGPU + Llama 3.2 3B Q4_K_M: 20–30 tok/s. `ollama run llama3.2:3b`

What is Actually "Fast" for Local LLMs?

Speed feels different depending on the task — use this as your reference:

If your model runs below 15 tok/sec, downgrade model size (7B → 3B) or drop one quantization level (Q5 → Q4) before buying new hardware.

  • Below 10 tok/sec → feels broken. Words appear one at a time with noticeable pauses. Unusable for interactive chat.
  • 15–25 tok/sec → acceptable. Readable speed for most users. Good for Q&A, summaries, and coding help.
  • 30+ tok/sec → smooth. Feels like a real assistant. Comfortable for all interactive tasks.
  • 60+ tok/sec → instant. Faster than you can read. Ideal for real-time autocomplete and rapid iteration.
Speed perception thresholds for local LLMs: below 10 tok/sec feels broken, 15–25 tok/sec is acceptable for Q&A, 30+ tok/sec is smooth for all tasks, 60+ tok/sec enables real-time autocomplete.
Speed perception thresholds for local LLMs: below 10 tok/sec feels broken, 15–25 tok/sec is acceptable for Q&A, 30+ tok/sec is smooth for all tasks, 60+ tok/sec enables real-time autocomplete.

What to Avoid on Low-End PCs

  • Do not run 13B+ models — they exceed RAM limits. A 13B model at Q4 requires 8–10 GB VRAM, pushing beyond practical low-end PC capacity. Even with aggressive Q2 quantization, 13B models require 5–6 GB, leaving insufficient headroom for OS and GPU scheduling overhead. Stick to 7B and below.
  • Avoid Q8 quantization — slower with minimal quality gain. Q8 uses nearly 2× the VRAM of Q4 (8 GB vs 5.5 GB for Mistral Small) while delivering only ~2% quality improvement. For 4 GB systems, Q8 is impractical; for 8 GB systems, Q4 remains optimal. Q3 is the only trade-off worth considering when Q4 OOMs.
  • Do not expect real-time autocomplete performance. At 3 tok/sec on CPU, generating 50 tokens takes 16 seconds. Interactive autocomplete requires ≥20 tok/sec. Local LLMs on low-end CPUs work for batch chat, drafting, and review — not live autocomplete or code-as-you-type scenarios.
  • Do not use CPU-only inference for production chatbots. Acceptable for internal tools, prototypes, and offline batch work. Cloud APIs (15–20 ms latency) outperform low-end CPU (300+ ms latency) for user-facing services. Use local inference for privacy-critical or offline scenarios, not speed-critical ones.

Common Mistakes

  • Mistake: Using TinyLlama on CPU for better speed. Problem: TinyLlama belongs on 4 GB VRAM, not CPU — Phi-4 Mini 3.8B is faster and far better on CPU-only hardware. Fix: Run Phi-4 Mini 3.8B on CPU; keep TinyLlama Q5 for 4 GB VRAM.
  • Mistake: Not enabling CPU acceleration flags. Problem: Missing AVX/NEON enables 20% speedup without cost. Fix: Set `LLAMACPP_AVX512=1` or `LLAMACPP_NEON=1` before running Ollama.
  • Mistake: Quantizing to Q2 to force 7B into 4GB. Problem: Q2 quantization often causes out-of-memory crashes due to KV cache overhead during inference. Fix: Use a 3B model at Q4 instead.
  • Mistake: Assuming newer hardware always means faster inference. Problem: Desktop Ryzen is not faster per-token than mobile ARM because desktop software lacks memory optimization. Fix: Benchmark your actual hardware.
  • Mistake: Using the wrong Ollama slug for your model. Problem: `ollama run phi` loads Phi-2, not Phi-4 Mini. Fix: Use `ollama run phi4-mini` for the latest Phi model. Always check ollama.com/library for exact model tags.

Local LLMs on Low-End PCs: Regional Context

EU / GDPR: Local on low-end hardware: no inference data leaves the device — for many SMEs and freelancers a technically straightforward way to avoid Art. 44 GDPR transfer risks. The EU AI Act (effective February 2025) does not impose documentation requirements on personal-use inference. For German SMEs using local LLMs for internal business tasks, BSI-Grundschutz recommends local inference for sensitive document processing. Overall data protection compliance still depends on your full operational setup, not the inference architecture alone.

Japan: METI AI Governance Guidelines encourage data minimization. CPU inference on low-end hardware, while slow, satisfies the strictest data sovereignty requirements — no API calls, no logging, no third-party data access. For Japanese users running Qwen3 on CPU for Japanese-language tasks, throughput of 1–3 tok/sec is acceptable for non-time-critical document summarization.

China: Local inference on consumer hardware is common for Qwen3 and DeepSeek-R1 deployments in China, where cloud API access to non-Chinese models is restricted. Qwen3 1.5B and 3B run on CPU-only hardware, providing a functional alternative to cloud APIs for users with constrained hardware.

Common Questions About Running Local LLMs on Low-End PCs

What qualifies as a low-end PC for running local LLMs?

A low-end PC for local LLMs is any machine with less than 8GB of dedicated VRAM, or a CPU-only system. This includes most laptops with Intel Iris or AMD Radeon integrated graphics, desktop PCs with GTX 1060 or older GPUs, and Chromebooks. The key constraint is not the CPU speed but the memory available to hold model weights.

Can I run Mistral Small on a 4GB GPU?

At Q2 quantization, yes. At Q4, no (OOM crash). Q2 has acceptable quality loss (~5-10% lower MMLU score), but speed increases by 30%. This is a practical trade-off for users with limited VRAM.

Is CPU inference usable for chatbots?

Yes, for low-throughput async scenarios. At 3 tok/sec, a 100-token response takes ~3 minutes. This is unusable for interactive conversation but acceptable for overnight batch processing or non-real-time tasks like email drafting.

Should I use Phi-4 Mini or TinyLlama 1.1B on CPU?

Phi-4 Mini 3.8B is the better choice for CPU-only systems — it hits 5–15 tok/sec and produces significantly better output quality than TinyLlama. TinyLlama 1.1B Q5 is optimized for 4 GB VRAM (20–40 tok/sec), not for CPU-only inference.

How do I check if my GPU supports CUDA?

Run `nvidia-smi` in terminal. If it prints GPU info, you have CUDA support. If it returns "command not found" or "no NVIDIA GPU", check Intel/AMD documentation for integrated GPU drivers.

How does quantization affect inference speed?

Quantization primarily reduces memory bandwidth requirements, not computation. Q2 (2-bit) is about 30% faster than Q4 (4-bit) because the model loads fewer bytes per forward pass. However, Q2 carries a ~10% quality penalty. The practical rule: use Q4 as default, drop to Q2 only if you cannot fit the model in available VRAM at Q4.

Can I use quantization below Q2?

Technically yes (Q1), but quality degrades catastrophically — up to 30% loss in accuracy. Not recommended for any practical use case.

Is CPU + GPU hybrid inference supported?

Yes, via layer offloading. With llama.cpp you can use `--n-gpu-layers 10` to offload the first 10 layers to GPU while keeping the rest on CPU. This hybrid approach gives you speed closer to GPU on limited VRAM.

What is the fastest local LLM?

The fastest models are 1B–3B parameter models like Llama 3.2 3B, which can reach 15–40 tokens/sec on optimized modern CPUs and up to 40–60 tok/sec with GPU acceleration. Speed depends more on hardware than model choice — a 7B on GPU (25–40 tok/sec) outpaces a 3B on CPU (10–25 tok/sec).

Can I run a local LLM on 4 GB RAM?

Yes — 1B models run comfortably on 4 GB systems (1–1.3 GB per model + 2–3 GB for OS and headroom). Larger models require more: 3B needs 2–3 GB, 7B needs 5.5–8 GB at Q4. For 4 GB systems, Llama 3.2 1B or TinyLlama 1.1B are practical choices, but quality is limited.

Is GPU required for speed?

No, but GPUs significantly increase speed. CPU-only systems can reach 10–25 tok/sec for 3B models with optimization; GPUs reach 25–60 tok/sec. For CPU-only users, smaller models (1B–3B) are essential. GPU is required only if you need interactive speeds on 7B+ models.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider’s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both — you pick the backend.

Join the PromptQuorum Waitlist →

← Back to Local LLMs