PromptQuorumPromptQuorum

Best Ollama Models for CPU Only?

Quick Answer

Without a GPU, Phi-4 Mini at Q4 is the best balance of quality and speed on CPU. Llama 3 8B Q4 works with 8+ GB RAM. Gemma 2B is the fastest CPU option.

  • Phi-4 Mini Q4: best quality/speed on CPU, needs 4 GB RAM
  • Llama 3 8B Q4: best quality, needs 8 GB RAM (slower)
  • Gemma 2B: fastest CPU inference, 2 GB RAM

Updated: 2026-05

OllamaIntermediate

Key Takeaways

  • CPU inference is 5–10× slower than GPU — expect 3–6 tok/s on a modern 8-core desktop CPU
  • Phi-4 Mini Q4 is the best CPU-only pick: 4 GB RAM, ~5 tok/s, strong reasoning quality
  • Gemma 2B is fastest on CPU (~6 tok/s) but has lower reasoning quality than Phi-4 Mini
  • CPU inference is practical for batch jobs and single-query lookups; too slow for interactive chat

The CPU Speed Reality

As of May 2026, CPU inference runs at 3–6 tokens per second on a modern 8-core desktop CPU — roughly 5–10× slower than a mid-range GPU. A 7B model at Q4 produces one word approximately every 200–300 milliseconds on CPU.

This speed is acceptable for two use cases: overnight batch processing such as summarizing documents or classifying data, and single-query lookups where a 30-second wait is acceptable. For interactive chat or real-time code completion, CPU inference is too slow to be practical.

The root constraint is memory bandwidth, not CPU clock speed. Consumer CPUs read RAM at 40–80 GB/s. A dedicated GPU reads VRAM at 400–900 GB/s. LLM inference scales directly with memory bandwidth — which is why even a mid-range GPU produces dramatically faster inference than a high-end CPU.

Top 3 Models for CPU-Only Use

The right CPU-only model depends on whether you prioritize quality or speed. Phi-4 Mini Q4 is the best balance — it delivers reasoning quality close to Llama 3 8B while needing only 4 GB RAM and running noticeably faster.

Gemma 2B is the only viable option when RAM is limited to 2 GB. It reaches ~6 tok/s on CPU but produces noticeably lower quality answers on multi-step reasoning tasks compared to Phi-4 Mini.

For the full breakdown of CPU-only configurations including RAM requirements and OS-level optimizations, see the best CPU-only LLM guide.

ModelRAM RequiredCPU Speed
Phi-4 Mini Q44 GB~4–5 tok/s
Llama 3 8B Q48 GB~3 tok/s
Gemma 2B2 GB~6 tok/s

Quick Answers About CPU-Only LLMs

How much RAM do I need for CPU-only Ollama?
Minimum 2 GB for Gemma 2B. 4 GB for Phi-4 Mini Q4. 8 GB for Llama 3 8B Q4. Add 1–2 GB on top of the model size for the OS and Ollama runtime overhead.
Why is CPU inference so much slower than GPU?
LLM inference is memory-bandwidth-bound. Consumer CPUs read RAM at 40–80 GB/s. A mid-range GPU reads VRAM at 400–900 GB/s. That 10× bandwidth difference translates directly into 5–10× slower token generation.
Can I use Ollama on a laptop without a dedicated GPU?
Yes. Ollama runs on CPU automatically when no GPU is detected. Expect 3–5 tok/s on a modern laptop CPU. See the best Ollama models right now for GPU-tier recommendations if you later upgrade.
Which CPUs are fastest for local LLM inference?
Apple M-series chips (M3, M4) use unified memory architecture and reach 15–30 tok/s on 7B models — far exceeding x86 CPUs on CPU-only inference. Among x86 CPUs, those with higher memory bandwidth and large L3 cache perform best.