Key Takeaways
- CPU only (no GPU): Phi-4 Mini 3.8B at 5β15 tok/sec. Best CPU option for chat and summaries.
- 4 GB VRAM: TinyLlama 1.1B Q5 at 20β40 tok/sec. Fast responses, simple tasks.
- 6 GB VRAM: Phi-4 Mini Q5 at 15β30 tok/sec. Lightweight coding and chat.
- 8 GB VRAM (sweet spot): Mistral 7B Q4 at 25β60 tok/sec. Smooth, full assistant experience.
- 16 GB+: 13B models Q4 at 20β50 tok/sec. Strong quality for demanding tasks.
- Speed ranking (fastest to slowest): 4GB GPU > 8GB GPU > 16GB+ > 6GB GPU > CPU.
- Quality ranking: 13B > Mistral 7B = Llama 3.1 8B > Phi-4 Mini > TinyLlama 1B.
- Cost: All free (open source) vs. ChatGPT API (~$0.002 per 1K tokens).
What is the Fastest Model for Your Hardware?
Match your hardware to the right model β the wrong choice leaves 10β30Γ speed on the table.
| Your Hardware | Recommended Model | Expected Speed |
|---|---|---|
| CPU only (no GPU) | Phi-4 Mini Q4 | 5β15 tok/sec |
| 4 GB VRAM (quality) | TinyLlama 1B Q5 | 20β40 tok/sec |
| 4 GB VRAM (speed) | Gemma 3 2B Q5 | 30β50 tok/sec |
| 6 GB VRAM | Phi-4 Mini Q5 | 15β30 tok/sec |
| 8 GB VRAM | Mistral 7B Q4 | 25β60 tok/sec |
| 16 GB+ | 13B models Q4 | 20β50 tok/sec |
Which Model Should You Use?
Match your situation to the right model β this is the single most important decision:
- 8 GB RAM laptop (no discrete GPU): Mistral 7B Q4 β best balance of speed and quality for CPU-only inference.
- 16 GB RAM: Llama 3.1 8B Q5 β higher quality than Q4, fits comfortably with headroom.
- Very old PC (4 GB RAM or less): TinyLlama 1B Q5 or Phi-4 Mini Q4 β only viable options at this tier.
- Want max speed: 3B models (Phi-4 Mini, Llama 3.2 3B) β 60β120 tok/sec on any modern GPU.
- Want quality: 7B Q5 (Mistral 7B Q5 or Llama 3.1 8B Q5) β best quality that fits under 8 GB VRAM.
Which Local LLM Should You Run on Your Hardware?
**Choose the largest model your VRAM can fit at Q4, then fall back to smaller quantization before switching to a smaller model. Quantization degrades quality less than a model size drop.**
| Hardware | Model | Quant | Speed | Experience |
|---|---|---|---|---|
| CPU only | Phi-4 Mini | Q4 | 5β15 t/s | slow but usable |
| 4 GB GPU | TinyLlama 1B | Q5 | 20β40 t/s | fast simple tasks |
| 6 GB GPU | Phi-4 Mini | Q5 | 15β30 t/s | decent |
| 8 GB GPU | Mistral 7B | Q4 | 25β60 t/s | smooth |
| 16 GB+ | 13B models | Q4 | 20β50 t/s | strong |
GPU vs CPU for Local LLMs: Which Is Faster on Low-End Hardware?
GPU inference: 15-20 tok/sec on RTX 3060. Requires CUDA setup. Fast, best quality. See budget GPU guide for cost-effective options.
iGPU (integrated): 5-8 tok/sec on Intel Iris. No setup needed. Slower than discrete GPU.
CPU inference: 1-5 tok/sec on modern multi-core. Runs everywhere. Slowest.
Rule: If you have any GPU (even integrated), use it. CPU is last resort.
Why Smaller Models Are Faster on Low-End PCs
Model size directly determines speed. A 1Bβ3B model fits entirely in system RAM, allowing the CPU or GPU to stream data continuously. Larger models require memory swapping β moving data between RAM and disk β which slows generation by 10β100Γ (the bottleneck is disk I/O, not compute).
The hardware decision table above reflects this principle: TinyLlama 1.1B (1B params) reaches 5β10 tok/sec on old CPUs, while 13B+ models are impractical on low-end hardware because swapping dominates.
- 1Bβ3B models: Fit in 4β8 GB RAM β fastest generation β acceptable quality
- 7B models: Borderline on 8 GB systems β slower due to memory pressure β high quality
- 13B+ models: Require 16+ GB VRAM or swap heavily β too slow for interactive use
How Fast Are Local LLMs on Low-End PCs?
On CPU-only systems, expect:
- 3B models β 15β40 tokens/sec (older CPUs: 10β15, newer CPUs with optimization: 30β40)
- 7B models β 10β25 tokens/sec (depends on CPU cores and quantization; with aggressive optimization some reach 30+)
- This is slower than cloud APIs (ChatGPT 4o: 80β150 tok/sec) but sufficient for interactive use. A 3B model at 25 tok/sec generates a 500-token response in 20 seconds β acceptable for non-time-critical tasks like code review, summarization, and creative writing.
How Does Quantization Affect Speed on Low-End PCs?
Q4 (4-bit): ~1% quality loss, 50% VRAM savings. Standard choice. For details on all quantization levels and how they work, see the full guide.
Q3 (3-bit): ~3% quality loss, 62% VRAM savings. Acceptable for chat.
Q2 (2-bit): ~10% quality loss, 75% VRAM savings. Risky; use only if OOM.
Speed impact: Q2 is ~30% faster than Q4 due to less memory bandwidth, not computation.
Strategy: Quantize larger models (Mistral 7B Q2) rather than use tiny models (TinyLlama).
Mistral 7B Q2 > TinyLlama 1.1B Q4 in both speed and quality.
Faster models trade quality for speed β but tuning temperature and top-p recovers much of that quality loss. Lower temperature (0.1β0.3) on fast models produces more consistent output than default settings. See temperature and top-p explained for the exact settings.
How Do You Speed Up CPU-Only Inference?
- Enable AVX-512: If CPU supports it, use `LLAMACPP_AVX512=1 ollama run phi`. ~20% speedup.
- Reduce context window: Shorter context = faster. Use `--ctx-size 1024` instead of 4096.
- **Use llama.cpp instead of Ollama:** Slightly faster on CPU (~10% gain) due to less overhead.
- Disable multithreading: Counter-intuitive, but on weak CPUs, single-threaded is faster (no thread overhead).
- Offload to iGPU: Even weak integrated GPU beats CPU. Check `lspci` for GPU availability.
How Fast Are These Models? Real Benchmarks (April 2026)
Real measurements by hardware tier, April 2026. All running Ollama with default settings, no tuning:
- CPU only (Ryzen 7 7700X) + Phi-4 Mini Q4: 5β15 tok/sec.
- 4 GB VRAM (GTX 1650) + TinyLlama 1B Q5: 20β40 tok/sec.
- 6 GB VRAM (RTX 2060) + Phi-4 Mini Q5: 15β30 tok/sec.
- 8 GB VRAM (RTX 3060) + Mistral 7B Q4: 25β60 tok/sec.
- 16 GB+ (RTX 3080 / 4070) + 13B models Q4: 20β50 tok/sec. For long documents, try Llama 4 Scout 8B (10M context window, released March 2026) with `ollama run llama4:8b`.
What is Actually "Fast" for Local LLMs?
Speed feels different depending on the task β use this as your reference:
If your model runs below 15 tok/sec, downgrade model size (7B β 3B) or drop one quantization level (Q5 β Q4) before buying new hardware.
- Below 10 tok/sec β feels broken. Words appear one at a time with noticeable pauses. Unusable for interactive chat.
- 15β25 tok/sec β acceptable. Readable speed for most users. Good for Q&A, summaries, and coding help.
- 30+ tok/sec β smooth. Feels like a real assistant. Comfortable for all interactive tasks.
- 60+ tok/sec β instant. Faster than you can read. Ideal for real-time autocomplete and rapid iteration.
What to Avoid on Low-End PCs
- Do not run 13B+ models β they exceed RAM limits. A 13B model at Q4 requires 8β10 GB VRAM, pushing beyond practical low-end PC capacity. Even with aggressive Q2 quantization, 13B models require 5β6 GB, leaving insufficient headroom for OS and GPU scheduling overhead. Stick to 7B and below.
- Avoid Q8 quantization β slower with minimal quality gain. Q8 uses nearly 2Γ the VRAM of Q4 (8 GB vs 5.5 GB for Mistral 7B) while delivering only ~2% quality improvement. For 4 GB systems, Q8 is impractical; for 8 GB systems, Q4 remains optimal. Q3 is the only trade-off worth considering when Q4 OOMs.
- Do not expect real-time autocomplete performance. At 3 tok/sec on CPU, generating 50 tokens takes 16 seconds. Interactive autocomplete requires β₯20 tok/sec. Local LLMs on low-end CPUs work for batch chat, drafting, and review β not live autocomplete or code-as-you-type scenarios.
- Do not use CPU-only inference for production chatbots. Acceptable for internal tools, prototypes, and offline batch work. Cloud APIs (15β20 ms latency) outperform low-end CPU (300+ ms latency) for user-facing services. Use local inference for privacy-critical or offline scenarios, not speed-critical ones.
Common Mistakes
- Mistake: Using TinyLlama on CPU for better speed. Problem: TinyLlama belongs on 4 GB VRAM, not CPU β Phi-4 Mini 3.8B is faster and far better on CPU-only hardware. Fix: Run Phi-4 Mini 3.8B on CPU; keep TinyLlama Q5 for 4 GB VRAM.
- Mistake: Not enabling CPU acceleration flags. Problem: Missing AVX/NEON enables 20% speedup without cost. Fix: Set `LLAMACPP_AVX512=1` or `LLAMACPP_NEON=1` before running Ollama.
- Mistake: Quantizing to Q2 to force 7B into 4GB. Problem: Q2 quantization often causes out-of-memory crashes due to KV cache overhead during inference. Fix: Use a 3B model at Q4 instead.
- Mistake: Assuming newer hardware always means faster inference. Problem: Desktop Ryzen is not faster per-token than mobile ARM because desktop software lacks memory optimization. Fix: Benchmark your actual hardware.
- Mistake: Using the wrong Ollama slug for your model. Problem: `ollama run phi` loads Phi-2, not Phi-4 Mini. Fix: Use `ollama run phi4-mini` for the latest Phi model. Always check ollama.com/library for exact model tags.
Local LLMs on Low-End PCs: Regional Context
EU / GDPR: Running local LLMs on low-end hardware is the most GDPR-compliant deployment pattern for individuals and small businesses β no data leaves the device. The EU AI Act (effective February 2025) does not impose documentation requirements on personal-use inference. For German SMEs using local LLMs for internal business tasks, BSI-Grundschutz recommends local inference for sensitive document processing.
Japan: METI AI Governance Guidelines encourage data minimization. CPU inference on low-end hardware, while slow, satisfies the strictest data sovereignty requirements β no API calls, no logging, no third-party data access. For Japanese users running Qwen2.5 on CPU for Japanese-language tasks, throughput of 1β3 tok/sec is acceptable for non-time-critical document summarization.
China: Local inference on consumer hardware is common for Qwen2.5 and DeepSeek-R1 deployments in China, where cloud API access to non-Chinese models is restricted. Qwen2.5 1.5B and 3B run on CPU-only hardware, providing a functional alternative to cloud APIs for users with constrained hardware.
Common Questions About Running Local LLMs on Low-End PCs
What qualifies as a low-end PC for running local LLMs?
A low-end PC for local LLMs is any machine with less than 8GB of dedicated VRAM, or a CPU-only system. This includes most laptops with Intel Iris or AMD Radeon integrated graphics, desktop PCs with GTX 1060 or older GPUs, and Chromebooks. The key constraint is not the CPU speed but the memory available to hold model weights.
Can I run Mistral 7B on a 4GB GPU?
At Q2 quantization, yes. At Q4, no (OOM crash). Q2 has acceptable quality loss (~5-10% lower MMLU score), but speed increases by 30%. This is a practical trade-off for users with limited VRAM.
Is CPU inference usable for chatbots?
Yes, for low-throughput async scenarios. At 3 tok/sec, a 100-token response takes ~3 minutes. This is unusable for interactive conversation but acceptable for overnight batch processing or non-real-time tasks like email drafting.
Should I use Phi-4 Mini or TinyLlama 1.1B on CPU?
Phi-4 Mini 3.8B is the better choice for CPU-only systems β it hits 5β15 tok/sec and produces significantly better output quality than TinyLlama. TinyLlama 1.1B Q5 is optimized for 4 GB VRAM (20β40 tok/sec), not for CPU-only inference.
How do I check if my GPU supports CUDA?
Run `nvidia-smi` in terminal. If it prints GPU info, you have CUDA support. If it returns "command not found" or "no NVIDIA GPU", check Intel/AMD documentation for integrated GPU drivers.
How does quantization affect inference speed?
Quantization primarily reduces memory bandwidth requirements, not computation. Q2 (2-bit) is about 30% faster than Q4 (4-bit) because the model loads fewer bytes per forward pass. However, Q2 carries a ~10% quality penalty. The practical rule: use Q4 as default, drop to Q2 only if you cannot fit the model in available VRAM at Q4.
Can I use quantization below Q2?
Technically yes (Q1), but quality degrades catastrophically β up to 30% loss in accuracy. Not recommended for any practical use case.
Is CPU + GPU hybrid inference supported?
Yes, via layer offloading. With llama.cpp you can use `--n-gpu-layers 10` to offload the first 10 layers to GPU while keeping the rest on CPU. This hybrid approach gives you speed closer to GPU on limited VRAM.
What is the fastest local LLM?
The fastest models are 1Bβ3B parameter models like Llama 3.2 3B, which can reach 15β40 tokens/sec on optimized modern CPUs and up to 40β60 tok/sec with GPU acceleration. Speed depends more on hardware than model choice β a 7B on GPU (25β40 tok/sec) outpaces a 3B on CPU (10β25 tok/sec).
Can I run a local LLM on 4 GB RAM?
Yes β 1B models run comfortably on 4 GB systems (1β1.3 GB per model + 2β3 GB for OS and headroom). Larger models require more: 3B needs 2β3 GB, 7B needs 5.5β8 GB at Q4. For 4 GB systems, Llama 3.2 1B or TinyLlama 1.1B are practical choices, but quality is limited.
Is GPU required for speed?
No, but GPUs significantly increase speed. CPU-only systems can reach 10β25 tok/sec for 3B models with optimization; GPUs reach 25β60 tok/sec. For CPU-only users, smaller models (1Bβ3B) are essential. GPU is required only if you need interactive speeds on 7B+ models.
Sources
- Phi-4 Mini Model Card β Microsoft Research. 68% MMLU, 70% HumanEval. Released 2025.
- Gemma 3 Model Card β Google DeepMind. Gemma 3 2B with 128K context window. Released 2025.
- Llama 4 Scout 8B β Meta. 10M context window, released March 2026.
- TinyLlama 1.1B Repository β Stability AI. Training completed 2024. Model stable, no longer receiving updates. Still recommended for 4 GB VRAM tier.
- llama.cpp CPU Optimization Guide β CPU acceleration flags including AVX-512, NEON, and thread configuration.