Points clΓ©s
- Best reasoning at small scale: Phi-4 Mini 3.8B β 68% MMLU, 70% HumanEval, runs on 4 GB RAM.
- Fastest on CPU: Gemma 2 2B β 40β60 tok/sec on any modern laptop, 1.7 GB RAM.
- Best small coding model: Qwen2.5 3B β 65% HumanEval at ~2 GB RAM.
- Best general-purpose 3B: Llama 3.2 3B β most community support, 128K context, 2.5 GB RAM.
- As of April 2026, no sub-2B model produces output quality suitable for professional tasks. Use 3B+ for real work.
What Is a "Small" Local LLM and When Should You Use One?
A small local LLM is typically defined as a model with fewer than 4 billion parameters. At Q4_K_M quantization, these models require 1.5β3 GB of RAM β well within the constraints of entry-level laptops with 4β8 GB total memory.
As of April 2026, small models are appropriate for: quick summarization, simple Q&A, code snippet explanation, translation of short texts, and classification tasks. They are not suitable for multi-step reasoning, complex code generation, or writing long-form coherent documents.
The quality gap between a 3B and 7B model is significant β roughly equivalent to the gap between GPT-3.5 Mini and GPT-3.5 Turbo. For users with 8 GB RAM, a 7B model at Q4_K_M is almost always the better choice if the machine has headroom. See Best Beginner Local LLM Models for 7B recommendations.
Phi-4 Mini 3.8B β Best Reasoning Performance in the Sub-4B Class
Microsoft Phi-4 Mini achieves 68% on MMLU and 70% on HumanEval β scores that exceed many 7B models released before 2025. This is possible because Phi-4 Mini was trained on a curated synthetic dataset focused on reasoning and problem-solving, rather than broad web text.
As of April 2026, Phi-4 Mini is the recommended choice for users who primarily need reasoning (math, logic, step-by-step explanations) or coding assistance on hardware with 4β6 GB RAM.
| Spec | Value |
|---|---|
| MMLU | 68% |
| HumanEval | 70% |
| RAM (Q4_K_M) | ~2.5 GB |
| Context | 128K tokens |
| CPU speed | 30β50 tok/sec |
| Ollama command | ollama run phi4-mini |
Gemma 2 2B β Fastest Small Local LLM on CPU
Google Gemma 2 2B generates 40β60 tokens/sec on a modern laptop CPU β the fastest of any model at this quality tier. Its 1.7 GB RAM footprint leaves ample memory for the OS and other applications on a 4 GB machine.
Quality is lower than Phi-4 Mini or Llama 3.2 3B on reasoning tasks. The 8K context window (vs. 128K on Phi-4 Mini and Llama 3.2) is a practical limitation for longer documents. Gemma 2 2B is the right choice when response speed matters more than output depth.
| Spec | Value |
|---|---|
| MMLU | 52% |
| RAM (Q4_K_M) | ~1.7 GB |
| Context | 8K tokens |
| CPU speed | 40β60 tok/sec |
| Ollama command | ollama run gemma2:2b |
Qwen2.5 3B β Best Small Model for Coding Tasks
Qwen2.5 3B scores 65% on HumanEval β 5 percentage points above Llama 3.2 3B β making it the best choice for coding tasks at the 3B scale. It includes JSON mode and function calling support, and natively handles 29 languages.
For non-coding tasks in English, Llama 3.2 3B and Phi-4 Mini produce more natural prose. Choose Qwen2.5 3B specifically when coding or multilingual output is the primary use case.
| Spec | Value |
|---|---|
| HumanEval | 65% |
| RAM (Q4_K_M) | ~2 GB |
| Context | 128K tokens |
| CPU speed | 25β40 tok/sec |
| Ollama command | ollama run qwen2.5:3b |
Llama 3.2 3B β Best General-Purpose Small Model
Meta Llama 3.2 3B is the most widely documented and community-supported 3B model. It scores 58% on MMLU and 60% on HumanEval β slightly below Phi-4 Mini on both β but has the widest tool support, the most fine-tunes available, and the largest collection of community guides.
The 128K context window is the same as larger Llama 3.x models, making it suitable for summarizing medium-length documents. For a first small model, Llama 3.2 3B remains the safest choice due to predictable behavior and extensive documentation.
| Spec | Value |
|---|---|
| MMLU | 58% |
| RAM (Q4_K_M) | ~2.5 GB |
| Context | 128K tokens |
| CPU speed | 25β45 tok/sec |
| Ollama command | ollama run llama3.2:3b |
Llama 3.2 1B β Absolute Minimum for Any Useful Output
Llama 3.2 1B requires only 1.3 GB of RAM and generates 60β90 tok/sec on CPU β the fastest locally-runnable model. Output quality is marginal: it handles very simple classification and keyword extraction but struggles with coherent multi-sentence responses. As of April 2026, use Llama 3.2 1B only when RAM is genuinely the binding constraint (under 3 GB available) or for testing tool integrations.
Full Comparison: Best Small Local LLMs Under 4B Parameters
| Model | MMLU | HumanEval | RAM | Context | Best For |
|---|---|---|---|---|---|
| Phi-4 Mini 3.8B | 68% | 70% | 2.5 GB | 128K | Reasoning, coding |
| Qwen2.5 3B | 62% | 65% | 2 GB | 128K | Coding, multilingual |
| Llama 3.2 3B | 58% | 60% | 2.5 GB | 128K | General use, first model |
| Gemma 2 2B | 52% | 38% | 1.7 GB | 8K | Speed, very low RAM |
| Llama 3.2 1B | 32% | 28% | 1.3 GB | 128K | Absolute minimum RAM |
What Are the Common Mistakes When Running Small Local LLMs?
Using Q8_0 quantization instead of Q4_K_M
Q8_0 requires nearly double the RAM of Q4_K_M for minimal quality improvement at small scale. A Llama 3.2 3B model at Q8_0 needs ~3.8 GB RAM vs ~2.5 GB for Q4_K_M. On a 4 GB machine, Q8_0 may trigger swap usage and make inference 3β5Γ slower. Always use Q4_K_M as the default for sub-4B models.
Running a base model instead of the instruct variant
Base models (e.g., `llama3.2:3b-text`) are pre-fine-tuning checkpoints trained to predict the next token in text. They do not follow instructions. When you ask a base model "What is 2+2?", it may complete the sentence as a quiz rather than answer "4". Always use the instruct variant: `llama3.2:3b` (Ollama defaults to instruct for named models).
Expecting 7B model quality from a 3B model
A 3B model at 68% MMLU (Phi-4 Mini) performs similarly to a 2023-era GPT-3.5 Mini on general tasks. Complex reasoning chains, long-form writing, and nuanced code generation will produce noticeably lower quality than a 7B model. If output quality is insufficient, upgrade to a 7B model β the RAM difference is ~2 GB (2.5 GB β 4.5 GB).
Common Questions About Small Local LLM Models
What is the smallest local LLM that produces useful output?
As of April 2026, the practical minimum for useful output is a 3B model at Q4_K_M quantization. Models below 2B parameters (Llama 3.2 1B, Gemma 2 2B) produce coherent single sentences but struggle with multi-step instructions, longer responses, and complex reasoning. For tasks like summarization and simple Q&A, Gemma 2 2B is usable. For anything more complex, start with a 3B model.
Can a 3B model run on a phone?
Yes β Llama 3.2 1B and 3B are specifically designed for on-device mobile deployment. Meta provides optimized builds for iOS (via MLC LLM) and Android. Inference on a modern phone (Snapdragon 8 Gen 3 or Apple A17 Pro) produces 15β30 tok/sec for 1B models. LM Studio and Ollama do not currently run on iOS or Android β mobile requires separate frameworks.
Are small models good for summarization?
Yes β summarization is one of the strongest use cases for small models. Gemma 2 2B and Llama 3.2 3B reliably produce accurate summaries of texts up to ~4,000 words (their practical context limit for quality output). For longer documents, use a model with a large context window like Phi-4 Mini or Llama 3.2 3B (both 128K tokens).
How much faster is a 2B model than a 7B model on the same hardware?
Approximately 2β3Γ faster on CPU. Gemma 2 2B generates 40β60 tok/sec vs 10β20 tok/sec for Mistral 7B on the same laptop CPU. On a GPU, the speed advantage narrows because GPU throughput is less constrained by model size. The speed difference is most noticeable on CPU-only machines.
Do small models support function calling?
Some do. Qwen2.5 3B supports function calling and JSON mode. Llama 3.2 3B has basic tool use support. Gemma 2 2B does not support function calling. Check the model's documentation before building a pipeline that depends on structured output.
Which small model is best for languages other than English?
Qwen2.5 3B supports 29 languages natively including Chinese, Japanese, Korean, and Arabic. Gemma 2 2B and Phi-4 Mini are primarily English-optimized. For non-English tasks at the small model scale, Qwen2.5 3B is the clear choice. See Multilingual Local LLMs for a full language comparison.
Sources
- Hugging Face Open LLM Leaderboard β open-llm-leaderboard.hf.space (MMLU and HumanEval scores)
- Microsoft Phi-4 Technical Report β microsoft.com/en-us/research/publication/phi-4-technical-report/
- Meta Llama 3.2 Model Card β huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- Google Gemma 2 Technical Report β storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf