重要なポイント
- Multimodal = text + images (+ audio). Process images natively without OCR preprocessing.
- Best models (2026): Llama 3.2 Vision 11B, Qwen2-VL 7B, Gemma 3 Vision 9B.
- Use cases: Document OCR, image analysis, visual Q&A, table extraction.
- Speed: 2–5 seconds per image (11B model). Slower than text-only, but practical.
- As of April 2026, multimodal is mature for specific use cases, not yet general-purpose.
Multimodal Models Available (April 2026)
| Model | Image Support | VRAM | Speed per Image | Best For |
|---|---|---|---|---|
| Llama 3.2 Vision 11B | Yes | 8 GB | — | General vision |
| Qwen2-VL 7B | Yes | 5 GB | — | Fast vision |
| Gemma 3 Vision 9B | Yes | 6 GB | — | Balanced |
| Llama 3.2 Vision 90B | Yes | 55 GB | — | High quality |
Vision Capabilities
Multimodal models can:
- Image description: Explain what is in an image.
- OCR (Optical Character Recognition): Extract text from images (business card, document scan).
- Visual Q&A: Answer questions about images ("What is the brand of the car?").
- Table extraction: Parse tables from images into structured data.
- Chart analysis: Interpret data visualizations.
- Object detection: Identify and locate objects in images.
Setup and Usage
Using Llama 3.2 Vision with Ollama:
python
# Pull the model
ollama pull llama3.2-vision:11b
# Use it
from ollama import Client
client = Client()
with open("image.jpg", "rb") as f:
image_data = f.read()
response = client.generate(
model="llama3.2-vision:11b",
prompt="Describe this image",
images=[image_data] # Pass image data
)
print(response["response"])Real-World Use Cases
- Document processing: Extract text from scanned PDFs without external OCR service.
- Content moderation: Flag inappropriate images without sending to cloud.
- Accessibility: Describe images for visually impaired users.
- Product analysis: Analyze product images in e-commerce (category, condition, defects).
- Research: Analyze scientific charts and diagrams.
Performance and Limitations
Accuracy: Good for document OCR and description, but not perfect for detailed analysis or small objects.
Speed: 2–5 seconds per image. Cloud models (GPT-4 Vision) are 10–50× faster.
Image size: Supports up to ~1000×1000 pixels. Larger images are downsampled.
Limitations: Cannot match GPT-4 Vision accuracy on complex scenes. Trade-off: privacy vs. quality.
Common Mistakes
- Expecting accuracy of GPT-4 Vision. Local models are 20–30% less accurate. Use for specific domains, not general vision.
- Not preparing images. Crop images to focus area. Remove noise. Better input = better output.
- Using 7B models for complex vision. Small models struggle with subtle details. Use 11B+ for reliable vision.
Sources
- Llama 3.2 Vision Model Card — huggingface.co/meta-llama/Llama-3.2-11B-Vision
- Qwen2-VL — github.com/QwenLM/Qwen2-VL