Quick Answer
Ollama supports several vision models: LLaVA, Gemma 3 multimodal, and Qwen-VL. Run ollama run llava for the easiest start. All accept images via the Ollama API.
Updated: 2026-05
Key Takeaways
As of May 2026, Ollama supports four production-ready vision models: LLaVA, Llama 3.2 Vision, Qwen-VL, and Gemma 3. Each has a distinct strength and VRAM profile.
LLaVA is the safest starting point β it has the broadest client compatibility and works with any image format Ollama accepts. Llama 3.2 Vision 11B is the best choice for OCR and multi-step visual reasoning. Qwen-VL leads on charts, diagrams, and structured documents. Gemma 3's vision variant handles 35+ languages β useful when images contain non-English text like signage, foreign-language documents, or charts with localized labels. LLaVA and Qwen-VL are strongest on English text.
All vision models load an image encoder alongside the LLM weights. This encoder adds 1β3 GB of VRAM above what the base text-only model needs β plan for that overhead when checking your VRAM budget.
Every vision model needs more VRAM than its text-only equivalent. A 7B vision model typically requires 7β9 GB VRAM, not the ~6 GB you would budget for a 7B text model.
For chart and document analysis, Qwen-VL 7B and Gemma 3 offer the most VRAM-efficient options with strong diagram understanding. For OCR and complex reasoning on images, Llama 3.2 Vision 11B justifies the extra VRAM. For the full guide on multimodal local models and use-case matching, see the multimodal local LLMs guide.
| Model | VRAM at Q4 | Image Capability |
|---|---|---|
| LLaVA 7B | ~7 GB | General image Q&A, broad compatibility |
| Llama 3.2 Vision 11B | ~10 GB | OCR, multi-step visual reasoning |
| Qwen-VL 7B | ~7 GB | Charts, diagrams, document analysis |
| Gemma 3 (vision) | ~6 GB | Multilingual image understanding |
/api/chat endpoint with the image as a base64 string in the images array. Minimum working JSON body: {"model":"llava","messages":[{"role":"user","content":"What is in this image?","images":["<base64>"]}]} See Qwen 3 on Ollama for a multimodal-capable option with strong tool calling support.