Best Ollama Models for CPU Only?
Quick Answer
Without a GPU, Phi-4 Mini at Q4 is the best balance of quality and speed on CPU, delivering reasoning quality close to Llama 3 8B while needing only 4 GB RAM. Llama 3 8B Q4 works with 8+ GB RAM. Gemma 2B is the fastest CPU option.
- βΈPhi-4 Mini Q4: best quality/speed on CPU, needs 4 GB RAM
- βΈLlama 3 8B Q4: best quality, needs 8 GB RAM (slower)
- βΈGemma 2B: fastest CPU inference, 2 GB RAM
Updated: 2026-05
Key Takeaways
- βCPU inference is 5β10Γ slower than GPU β expect 3β6 tok/s on a modern 8-core desktop CPU
- βPhi-4 Mini Q4 is the best CPU-only pick: 4 GB RAM, ~5 tok/s, strong reasoning quality
- βGemma 2B is fastest on CPU (~6 tok/s) but has lower reasoning quality than Phi-4 Mini
- βCPU inference is practical for batch jobs and single-query lookups; too slow for interactive chat
The CPU Speed Reality
As of May 2026, CPU inference runs at 3β6 tokens per second on a modern 8-core desktop CPU β roughly 5β10Γ slower than a mid-range GPU. A 7B model at Q4 produces one word approximately every 200β300 milliseconds on CPU.
This speed is acceptable for two use cases: overnight batch processing such as summarizing documents or classifying data, and single-query lookups where a 30-second wait is acceptable. For interactive chat or real-time code completion, CPU inference is too slow to be practical.
The root constraint is memory bandwidth, not CPU clock speed. Consumer CPUs read RAM at 40β80 GB/s. A dedicated GPU reads VRAM at 400β900 GB/s. LLM inference scales directly with memory bandwidth β which is why even a mid-range GPU produces dramatically faster inference than a high-end CPU.
Top 3 Models for CPU-Only Use
The right CPU-only model depends on whether you prioritize quality or speed. Phi-4 Mini Q4 is the best balance β it delivers reasoning quality close to Llama 3 8B while needing only 4 GB RAM and running noticeably faster.
Gemma 2B is the only viable option when RAM is limited to 2 GB. It reaches ~6 tok/s on CPU but produces noticeably lower quality answers on multi-step reasoning tasks compared to Phi-4 Mini.
For the full breakdown of CPU-only configurations including RAM requirements and OS-level optimizations, see the best CPU-only LLM guide.
| Model | RAM Required | CPU Speed |
|---|---|---|
| Phi-4 Mini Q4 | 4 GB | ~4β5 tok/s |
| Llama 3 8B Q4 | 8 GB | ~3 tok/s |
| Gemma 2B | 2 GB | ~6 tok/s |
Related Guides
- βΈRadeon 6800M for Local LLM: Full Setup Guide -- Radeon GPU guide
- βΈStrix Halo + Ollama + Vulkan: Performance Guide -- Strix Halo guide
Quick Answers About CPU-Only LLMs
How much RAM do I need for CPU-only Ollama?βΎ
Why is CPU inference so much slower than GPU?βΎ
Can I use Ollama on a laptop without a dedicated GPU?βΎ
Which CPUs are fastest for local LLM inference?βΎ
Want the full breakdown?
Read the complete guide βRelated Prompt Bites