Quick Answer
4 GB VRAM is tight but usable. Best options: Phi-4 Mini at Q4 (~3.2 GB), Gemma 2 2B (~1.5 GB), and SmolLM 1.7B (~1.0 GB). Llama 3 8B will not fit.
Updated: 2026-05
Key Takeaways
As of May 2026, at 4 GB VRAM you are limited to models with 3 billion parameters or fewer at Q4 quantization. This rules out every mainstream local model β Llama 3 8B, Mistral 7B, Qwen 14B. Three modern small models perform surprisingly well: Phi-4 Mini matches GPT-3.5 on instruction following, Gemma 2 2B handles fast chat, and SmolLM 1.7B runs on integrated graphics.
Phi-4 Mini is the top pick at this tier. Despite its small size, it handles general Q&A, light coding, and document summarization at ~25 tokens per second. Gemma 2 2B is faster for single-turn chat. SmolLM 1.7B is the fallback if even Phi-4 Mini pushes your VRAM too close to the limit.
| Model | VRAM | Best For |
|---|---|---|
| Phi-4 Mini Q4 | 3.2 GB | Best quality at 4 GB |
| Gemma 2 2B Q4 | 1.5 GB | Fast single-turn chat |
| SmolLM 1.7B Q4 | 1.0 GB | Minimal VRAM footprint |
These models are commonly requested but require more than 4 GB VRAM at every quantization level:
Upgrading to 6 GB unlocks Llama 3 8B and Mistral 7B β the two most popular local models. See the best local LLMs for 6 GB VRAM. For a full hardware comparison, see fastest local LLMs for low-end PCs.