Key Takeaways
- CPU-only inference works well for 3β13B models on modern processors with 8β32 GB RAM.
- Best CPU models: Phi-4 Mini (3.8B, 2.3 GB, 12 tokens/sec), Gemma 3 2B (1.5 GB, 15 tokens/sec), Llama 3.2 3B (2 GB, 10 tokens/sec).
- CPU inference is 10β30Γ slower than GPU but uses zero dedicated VRAM.
- Enable CPU-only mode in Ollama or llama.cpp with a simple command-line flag.
- CPU inference is ideal for production APIs (no GPU overhead), edge devices, and cost-constrained environments.
Can CPUs Run LLMs?
Yes, modern CPUs (Intel i7-10th gen+, AMD Ryzen 5000+, Apple M-series) can run 3β13B models at 8β15 tokens/second. This is 10β30Γ slower than GPU but doesn't require dedicated VRAM. A CPU with sufficient system RAM (8β32 GB) can run models that would require a $300+ GPU.
CPU inference trades speed for accessibility: you get zero-GPU overhead, perfect stability, and no driver issues. For casual use cases (chatbots answering a few requests/second, offline document processing), CPU-only is practical.
Modern CPUs have AVX-512 or NEON/SVE vector instructions that accelerate matrix math. Tools like llama.cpp and Ollama automatically use these, making CPU inference much faster than naive implementations.
Best CPU-Only Models 2026
The table below ranks models by performance on Intel i7-12700 (12-core, AVX-512) with CPU-only mode:
| Model | Params | GGUF Size | RAM Needed | CPU Speed | Best For |
|---|---|---|---|---|---|
| Phi-4 Mini | 3.8B | ~2.3 GB | 4 GB | 12 tok/sec | General chat, code assist |
| Gemma 3 2B | 2B | ~1.5 GB | 3 GB | 15 tok/sec | Fast responses, low VRAM |
| Llama 3.2 3B | 3B | ~2 GB | 3.5 GB | 10 tok/sec | Balanced quality/speed |
| Mistral 7B Q4 | 7B | ~4.5 GB | 6 GB | 5 tok/sec | Better quality, 16+ GB RAM |
| Llama 3.1 8B Q4 | 8B | ~5 GB | 7 GB | 4 tok/sec | Coding, logic tasks |
Speed Comparison: CPU vs GPU
Speed varies by hardware. These benchmarks are on standard 2026 hardware running via Ollama or llama.cpp:
| Hardware | Model | Speed | Notes |
|---|---|---|---|
| Intel i7-12700 (CPU) | Phi-4 Mini 3.8B | 12 tokens/sec | AVX-512 enabled |
| AMD Ryzen 7 5700X (CPU) | Phi-4 Mini 3.8B | 9 tokens/sec | Older AVX2 only |
| Apple M3 (CPU) | Phi-4 Mini 3.8B | 14 tokens/sec | Unified memory advantage |
| RTX 3060 (GPU, 12 GB) | Phi-4 Mini 3.8B | 80 tokens/sec | GPU is 6.7Γ faster |
| RTX 4090 (GPU, 24 GB) | Llama 3.1 8B Q4 | 120 tokens/sec | GPU is 30Γ faster than CPU |
RAM Requirements by Model
Rule of thumb: GGUF size + 500 MB overhead = minimum RAM needed. A 2 GB GGUF model needs 2.5β3 GB of free system RAM:
| Model | GGUF Size | Min RAM | Comfortable | Context Length |
|---|---|---|---|---|
| Gemma 3 2B | ~1.5 GB | 2β2.5 GB | 4 GB | 8K |
| Phi-4 Mini 3.8B | ~2.3 GB | 3 GB | 6 GB | 4K |
| Llama 3.2 3B | ~2 GB | 2.5β3 GB | 6 GB | 8K |
| Mistral 7B Q4 | ~4.5 GB | 5 GB | 8 GB | 32K |
| Llama 3.1 8B Q4 | ~5 GB | 6 GB | 12 GB | 128K |
How to Run CPU-Only Mode
Ollama (simplest): Simply run `ollama run phi:mini`. Ollama automatically detects CPU-only on systems without NVIDIA/AMD GPUs and uses system RAM. LM Studio: Open Settings β select "None" under GPU to force CPU mode. Llama.cpp: Use flag `--n-gpu-layers 0` to disable GPU offloading.
ollama run phi:mini
# Ollama auto-detects CPU-only systemsOptimization Tips for CPU Inference
To squeeze maximum performance from CPU inference:
- Use Q4_K_M quantization β reduces GGUF size by ~70%, minimal quality loss, 10β20% speed increase due to better cache behavior.
- Reduce context window β longer contexts = slower inference. Use `--context 2048` to cap context to 2K tokens.
- Enable multi-threading β Ollama and llama.cpp auto-detect CPU core count. Verify with `nproc` that it matches.
- Use AVX-512 or ARM NEON β modern Intel/AMD/ARM CPUs have vector instructions. Check CPU flags: `cat /proc/cpuinfo | grep avx512` (Linux) or Apple About β System Report (Mac).
- Batch size = 1 β CPU handles single-sequence inference best. Don't attempt multi-batch on CPU.
- Pin threads to cores β on Linux, use `numactl --cpunodebind=0 ollama run phi:mini` to avoid core switching overhead.
When to Use CPU vs GPU Inference
| Use Case | CPU | GPU |
|---|---|---|
| Real-time chat (sub-1-sec latency) | β Too slow (12 tok/sec = 5 sec for 60 tokens) | β 80+ tok/sec |
| Batch processing (documents, logs) | β Fine (speed doesn't matter) | β οΈ Overkill |
| Production API (cost-constrained) | β $0 hardware cost | β οΈ $200+ GPU + electricity |
| Edge device (Raspberry Pi) | β No alternative | β Limited GPU options |
| Development / local testing | β Lower power, quieter | β οΈ Overkill |
| LLM fine-tuning | β Too slow (hours β days) | β 10β30Γ speedup |
FAQ
How fast is CPU-only inference compared to a GPU?
CPU: 8β15 tokens/sec on modern processors. GPU (RTX 3060): 80 tokens/sec. GPU (RTX 4090): 120+ tokens/sec. CPU is 10β30Γ slower but requires $0 GPU investment.
What's the smallest model that still produces coherent output on CPU?
Gemma 3 2B (1.5 GB) produces reasonable responses. Below 2B, quality drops. For best quality on 8 GB RAM, use Phi-4 Mini (3.8B) or Llama 3.2 3B (2 GB).
Can I run a 13B model on CPU?
Yes, with Q4_K_M quantization a 13B model is ~6.5 GB. Needs 8β12 GB system RAM. Speed: ~2β3 tokens/sec. Uncomfortable for interactive use but works for batch processing.
Does CPU inference use the GPU at all?
No. CPU-only mode in Ollama/llama.cpp explicitly disables GPU usage and uses system RAM exclusively.
Is CPU-only inference stable?
Yes, more stable than GPU. No driver crashes, no out-of-memory GPU errors. The only risk is system RAM saturation, which you control by model choice.
Do I need to adjust settings for Apple Silicon CPUs?
No. Ollama auto-detects M1/M2/M3/M4 and uses unified memory efficiently. Apple Silicon is ~10β20% faster than equivalent Intel CPUs due to memory architecture.