Which Ollama Models Support Vision?
Quick Answer
Ollama supports several vision models: LLaVA, Gemma 3 multimodal, and Qwen-VL. Run ollama run llava for the easiest start. All accept images via the Ollama API.
- ▸llava: original vision model, best compatibility
- ▸gemma3: Google multimodal model, good quality
- ▸qwen-vl: strong for document understanding
Updated: 2026-05
Key Takeaways
- ✓Four Ollama vision models are production-ready: LLaVA, Llama 3.2 Vision, Qwen-VL, and Gemma 3
- ✓Vision models need 1–3 GB more VRAM than their text-only equivalents — the image encoder runs alongside the LLM
- ✓LLaVA 7B is the safest starting point (~7 GB VRAM, broad client compatibility)
- ✓Use Qwen-VL for chart and diagram analysis; use Llama 3.2 Vision 11B for OCR and multi-step reasoning
The Top Vision Models on Ollama
As of May 2026, Ollama supports four production-ready vision models: LLaVA, Llama 3.2 Vision, Qwen-VL, and Gemma 3. Each has a distinct strength and VRAM profile.
LLaVA is the safest starting point — it has the broadest client compatibility and works with any image format Ollama accepts. Llama 3.2 Vision 11B is the best choice for OCR and multi-step visual reasoning. Qwen-VL leads on charts, diagrams, and structured documents. Gemma 3's vision variant handles 35+ languages — useful when images contain non-English text like signage, foreign-language documents, or charts with localized labels. LLaVA and Qwen-VL are strongest on English text.
All vision models load an image encoder alongside the LLM weights. This encoder adds 1–3 GB of VRAM above what the base text-only model needs — plan for that overhead when checking your VRAM budget.
VRAM Requirements for Vision
Every vision model needs more VRAM than its text-only equivalent. A 7B vision model typically requires 7–9 GB VRAM, not the ~6 GB you would budget for a 7B text model.
For chart and document analysis, Qwen-VL 7B and Gemma 3 offer the most VRAM-efficient options with strong diagram understanding. For OCR and complex reasoning on images, Llama 3.2 Vision 11B justifies the extra VRAM. For the full guide on multimodal local models and use-case matching, see the multimodal local LLMs guide.
| Model | VRAM at Q4 | Image Capability |
|---|---|---|
| LLaVA 7B | ~7 GB | General image Q&A, broad compatibility |
| Llama 3.2 Vision 11B | ~10 GB | OCR, multi-step visual reasoning |
| Qwen-VL 7B | ~7 GB | Charts, diagrams, document analysis |
| Gemma 3 (vision) | ~6 GB | Multilingual image understanding |
Related Guides
- ▸Ollama 128K Context Models -- long context models
Quick Answers About Ollama Vision Models
How do I send an image to Ollama via the API?▾
/api/chat endpoint with the image as a base64 string in the images array. Minimum working JSON body: {"model":"llava","messages":[{"role":"user","content":"What is in this image?","images":["<base64>"]}]} See Qwen 3 on Ollama for a multimodal-capable option with strong tool calling support.Can vision models do OCR (read text from images)?▾
Which Ollama vision model is best for charts and diagrams?▾
Do vision models support multiple images in one prompt?▾
Want the full breakdown?
Read the complete guide →Related Prompt Bites