PromptQuorumPromptQuorum
Home/Local LLMs/How to Run Qwen2-VL Locally in 2026: Document OCR & Vision Guide
Advanced Techniques

How to Run Qwen2-VL Locally in 2026: Document OCR & Vision Guide

Β·11 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Run `ollama pull qwen2-vl:7b` on any machine with 8 GB of VRAM to read Chinese, Japanese, and mixed-language documents locally. Qwen2-VL is the strongest open vision model for multilingual OCR β€” every image is processed on your machine, with no cloud upload.

Qwen2-VL is Alibaba's open vision-language model, and its 7B variant runs locally in about 6 GB of VRAM via Ollama or LM Studio. It reads documents, screenshots, charts, and photos β€” and leads every other local vision model on Chinese, Japanese, and Korean OCR. This guide covers model selection, hardware, Ollama and LM Studio setup, multilingual document extraction, and how Qwen2-VL compares to LLaVA and Llama 3.2 Vision.

Key Takeaways

  • Qwen2-VL 7B runs locally in ~6 GB of VRAM (Q4) via Ollama β€” a single `ollama pull qwen2-vl:7b` command, no model conversion required.
  • Best local model for multilingual OCR: Qwen2-VL ties MiniCPM-V 2.6 and beats LLaVA 1.6 and Llama 3.2 Vision 11B on Chinese, Japanese, and Korean text.
  • Native resolution up to 4096Γ—4096 β€” reads high-DPI scans without downsampling, unlike LLaVA 1.6 (672Γ—672) or Llama 3.2 Vision (1120Γ—1120).
  • Three sizes: 2B (~3 GB VRAM, fast and basic), 7B (~6 GB, recommended for most users), 72B (~48 GB, tops open-source benchmarks).
  • Accepts up to 8 images per request β€” the highest multi-image capacity among local vision models.
  • No direct PDF input: convert PDF pages to PNG or JPEG first, then send each page as a separate image.
  • 100% offline once downloaded: no API key, no cloud upload β€” every document stays on your machine, which removes the AI layer from GDPR data-transfer scope.

Why Qwen2-VL Leads Local Vision Models for Multilingual OCR

Qwen2-VL is the strongest local vision model for multilingual document OCR β€” it matches or beats every other model that runs on consumer hardware at reading Chinese, Japanese, Korean, and English text. Alibaba trained it on large-scale multilingual document corpora, which is why it outperforms LLaVA 1.6 and Llama 3.2 Vision 11B on non-English text extraction.

Qwen2-VL supports dynamic input resolution up to 4096Γ—4096 pixels. LLaVA 1.6 caps at 672Γ—672 and Llama 3.2 Vision at 1120Γ—1120, so both downsample high-DPI scans before reading them. Qwen2-VL reads a 300-DPI A4 scan at native resolution β€” the main reason its OCR accuracy is higher on dense documents and small CJK characters.

Running Qwen2-VL locally costs €0 per image after hardware. A cloud vision API bills roughly $0.01–0.03 per image; at 10,000 images per month that is $100–300 saved β€” and no document ever leaves your machine.

Use Qwen2-VL if your documents contain CJK text, small fonts, or high-DPI scans. If your work is English-only photo Q&A, Llama 3.2 Vision 11B is an equally good choice.

πŸ“ In One Sentence

Qwen2-VL is the most accurate local vision model for Chinese, Japanese, and Korean document OCR, running in ~6 GB of VRAM via Ollama.

πŸ’¬ In Plain Terms

A vision-language model reads images instead of generating them. You give Qwen2-VL a photo or a scanned page, and it returns text β€” a description, an answer, or the extracted contents.

Choosing Your Qwen2-VL Model Size

Qwen2-VL comes in three sizes. Choose based on your VRAM and the accuracy you need. All sizes are on Hugging Face (Qwen) and in the Ollama model library with explicit tags.

ModelVRAM (Q4)Ollama tagBest For
Qwen2-VL 2B Q4~3 GBqwen2-vl:2bFast captions, simple OCR, low-VRAM laptops
Qwen2-VL 7B Q4~6 GBqwen2-vl:7bRecommended β€” document OCR, image Q&A, charts
Qwen2-VL 72B Q4~48 GBqwen2-vl:72bMaximum quality, Apple Silicon 64 GB+ or multi-GPU

Q4_K_M is the recommended quantization β€” the best quality-to-size ratio. Most users should start with Qwen2-VL 7B: it fits an 8 GB GPU and handles every use case in this guide. Drop to the 2B model only when VRAM is below 6 GB. See LLM quantization explained for how Q4 affects quality.

Hardware Requirements for Qwen2-VL

  • Minimum (Qwen2-VL 7B Q4): GPU with 8 GB VRAM β€” NVIDIA RTX 4060, RTX 3060 12 GB, or RTX 2080.
  • Low-VRAM option (Qwen2-VL 2B Q4): 4 GB VRAM β€” runs on most laptop GPUs and integrated Apple Silicon.
  • Maximum quality (Qwen2-VL 72B Q4): ~48 GB β€” Apple Silicon with 64 GB+ unified memory, or two 24 GB GPUs.
  • Apple Silicon: an M-series chip with 16 GB+ unified memory runs the 7B model comfortably; 64 GB+ is needed for the 72B.
  • System RAM: 16 GB minimum alongside GPU inference; 32 GB recommended with a full dev environment open.
  • Storage: ~6 GB free disk space for Qwen2-VL 7B Q4 (GGUF), ~30 GB for the 72B.

πŸ“ŒNote: Vision models run roughly 30–60% slower than a text-only model of the same parameter count. The vision encoder processes the entire image on the first token; text then generates at near-normal speed. Budget VRAM for the encoder as well as the language model.

Setting Up Qwen2-VL with Ollama

Ollama is the fastest way to run Qwen2-VL locally. It downloads the model, manages quantization, and exposes an API at localhost:11434. Install it from ollama.com β€” or, if you are new to it, start with how to install Ollama.

  1. 1
    Install Ollama
    Why it matters: Ollama handles model download, GGUF format, and the local API. It is available for macOS, Linux, and Windows.
  2. 2
    Pull Qwen2-VL with an explicit size tag
    Why it matters: Use qwen2-vl:7b. The bare qwen2-vl tag can resolve to a different size β€” always specify 2b, 7b, or 72b so you get the model this guide targets.
  3. 3
    Run the model and attach an image
    Why it matters: In interactive mode, type the image file path inside your prompt. Ollama detects the path and loads the image into the vision encoder.
  4. 4
    Send images through the API
    Why it matters: The /api/generate endpoint accepts a base64-encoded images array. This is how applications β€” and PromptQuorum β€” send images programmatically.
  5. 5
    Verify multilingual OCR
    Why it matters: Send a Chinese or Japanese document scan and confirm the extracted text matches. This proves the vision encoder and tokenizer handle CJK script correctly before you build on it.
bash
# Step 1 β€” Install Ollama
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows β€” download from https://ollama.com/download

# Step 2 β€” Pull Qwen2-VL 7B (explicit size tag)
ollama pull qwen2-vl:7b
# Downloads Qwen2-VL 7B Q4_K_M (~6 GB)

# Step 3 β€” Run and attach an image (interactive)
ollama run qwen2-vl:7b
>>> Extract every line of text from ./invoice-jp.png

# Step 4 β€” Send an image through the API
# Encode the image first:  base64 -i scan.png   (macOS)
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2-vl:7b",
  "prompt": "Extract every line of text from this document. Preserve line breaks.",
  "images": ["<base64-encoded-image>"],
  "stream": false
}'

# Step 5 β€” Verify multilingual OCR
ollama run qwen2-vl:7b
>>> Extract all text from this image: ./contract-zh.png

⚠️Warning: Send document images at 150 DPI or higher. Qwen2-VL reads up to 4096Γ—4096 natively, so high-resolution scans directly improve accuracy. Unlike a text prompt, image quality is the single biggest factor in OCR results β€” a blurry scan produces wrong characters no matter how good the model is.

Setting Up Qwen2-VL with LM Studio

LM Studio runs Qwen2-VL through a graphical interface with no CLI commands. It is the recommended path for Windows users and anyone who prefers a GUI. Download it from lmstudio.ai, or see how to install LM Studio.

  1. 1
    Download and install LM Studio
    Why it matters: A free, cross-platform GUI for local model inference. No terminal required.
  2. 2
    Search for Qwen2-VL in the model browser
    Why it matters: Search "Qwen2-VL 7B" and select a Q4_K_M GGUF build. LM Studio marks vision-capable models with an image icon.
  3. 3
    Load the model and attach an image
    Why it matters: Click the image icon in the chat input to upload a photo or scan. LM Studio passes it to the vision encoder.
  4. 4
    Start the local server
    Why it matters: The "Start Server" button exposes an OpenAI-compatible API at localhost:1234. Vision requests use the standard image_url content format.
json
// LM Studio β€” OpenAI-compatible vision request (localhost:1234)
{
  "model": "qwen2-vl-7b",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Extract all text from this document." },
        {
          "type": "image_url",
          "image_url": { "url": "data:image/png;base64,<base64-encoded-image>" }
        }
      ]
    }
  ]
}

Document OCR for Chinese, Japanese, and Mixed-Language Files

Qwen2-VL extracts text from Chinese, Japanese, Korean, and mixed-language documents more accurately than any other local vision model. Its training data included large multilingual document corpora, and its 4096Γ—4096 native resolution reads small CJK characters that LLaVA 1.6 and Llama 3.2 Vision downsample and miss.

The most reliable pattern is a specific extraction prompt. Ask for structure β€” "preserve the table layout", "return each field as key: value" β€” instead of a vague "read this". Qwen2-VL follows formatting instructions closely, which keeps the output usable without post-processing.

πŸ“ In One Sentence

To extract text from a CJK document with Qwen2-VL, send the image at 150+ DPI with a specific prompt that asks for structure, such as "return each field as key: value".

πŸ’¬ In Plain Terms

OCR means turning a picture of text into editable text. Qwen2-VL looks at a scanned page and types out what it sees β€” and it handles Chinese and Japanese characters as well as it handles English.

  • Plain text extraction: "Extract every line of text from this image. Preserve line breaks and reading order."
  • Structured fields: "This is a Japanese invoice. Return vendor, date, subtotal, tax, and total as key: value pairs."
  • Table extraction: "Extract this table as CSV. Treat the first row as the header."
  • Extract and translate in one pass: "Extract the Chinese text from this image, then translate it to English. Show both."
bash
# Japanese invoice -> structured fields
ollama run qwen2-vl:7b
>>> This is a Japanese invoice. Extract vendor name, invoice date,
    subtotal, consumption tax, and total. Return as key: value pairs.
    ./invoice-jp.png

# Example output:
# vendor: Sample Trading Co., Ltd.
# date: 2026-04-30
# subtotal: 84,000 JPY
# tax: 8,400 JPY
# total: 92,400 JPY

β€’Important: Always verify extracted numbers against the source document. Local vision models β€” Qwen2-VL included β€” can misread a digit on a low-quality scan. Treat OCR output as a draft to confirm, not a final value, especially for invoices and financial documents.

Image Q&A, Screenshot Analysis, and Chart Reading

Beyond OCR, Qwen2-VL handles general image understanding β€” describing photos, answering questions about screenshots, and reading charts. It is accurate on clear input and weaker on cluttered or ambiguous scenes.

  • Image Q&A: ask open questions about a photo β€” "What is in this image?", "How many people are wearing red?". Qwen2-VL 7B is accurate on clear photos, weaker on cluttered or ambiguous scenes.
  • Screenshot and UI analysis: Qwen2-VL reads UI screenshots, error dialogs, and app states. For dense code screenshots specifically, InternVL 2.5 is trained harder on that data β€” use it if UI and code is your main workload.
  • Chart and graph reading: Qwen2-VL describes chart structure and trends well, but precise numeric values pulled from charts are unreliable on every local vision model. Confirm exact figures against the underlying data.
  • Video frames: Qwen2-VL accepts multiple frames as a sequence β€” sample roughly one frame per second and send up to 8 to summarize a short clip.
  • Multi-image comparison: send up to 8 images in one request to compare versions, spot differences, or batch-describe a set.

πŸ’‘Tip: Use Qwen2-VL for OCR, multilingual documents, and general image Q&A. Switch to InternVL 2.5 when your main workload is code or UI screenshots, or to Moondream 2 when you have under 4 GB of VRAM.

Qwen2-VL vs LLaVA vs Llama 3.2 Vision

For multilingual OCR, Qwen2-VL beats LLaVA 1.6 and matches or beats Llama 3.2 Vision 11B at lower VRAM. For English-only photo Q&A, Llama 3.2 Vision 11B is an equally strong pick. LLaVA 1.6 remains the most documented model if you need community troubleshooting resources.

ModelVRAM (Q4)OCR / CJKMax ResolutionBest For
Qwen2-VL 7B~6 GBExcellent4096Γ—4096Multilingual OCR, high-DPI scans
Llama 3.2 Vision 11B~8 GBGood1120Γ—1120English photo Q&A, general docs
LLaVA 1.6 7B~6 GBFair672Γ—672General Q&A, community support
MiniCPM-V 2.6 8B~6 GBExcellent1792Γ—1792Document OCR (English-leaning)
InternVL 2.5 8B~8 GBGoodHighCode and UI screenshots

All five run via Ollama (InternVL 2.5 through community builds). For the full local vision model survey β€” including Moondream 2 and an invoice-extraction benchmark β€” see the local vision models comparison. If unsure, start with Qwen2-VL 7B: it covers OCR, documents, and general Q&A in 6 GB of VRAM.

Connecting Local Qwen2-VL to PromptQuorum

PromptQuorum routes prompts across multiple models. To use local Qwen2-VL as a vision dispatch target, point PromptQuorum's local LLM endpoint at your Ollama server. Image processing then stays on your hardware, while cloud models remain available for text tasks.

This is the Ollama (OpenAI-compatible) endpoint, separate from the Anthropic API configuration used for Claude. Both can be active at once, with PromptQuorum routing by task type and data sensitivity.

πŸ“ In One Sentence

Connect PromptQuorum to local Qwen2-VL by setting OLLAMA_BASE_URL to http://localhost:11434/v1 and pointing the local vision model to qwen2-vl:7b.

bash
# PromptQuorum dispatch config β€” local Qwen2-VL via Ollama
# Set in your .env or the PromptQuorum settings panel

OLLAMA_BASE_URL=http://localhost:11434/v1
LOCAL_VISION_MODEL=qwen2-vl:7b

# Example routing rules:
# - task_type: ocr / image  -> qwen2-vl:7b        (local Ollama, no cloud upload)
# - task_type: text         -> claude-sonnet-4-6  (Anthropic API, separate config)

Troubleshooting Qwen2-VL

  • "unknown model" or the pull fails: use an explicit size tag β€” `ollama pull qwen2-vl:7b`, not `qwen2-vl`. Run `ollama list` to confirm the installed name.
  • The image is ignored and the model answers as if no image was sent: confirm the file path is correct and readable. In the Ollama API, the `images` array must contain raw base64 *without* the `data:` prefix β€” the `data:` prefix is LM Studio and OpenAI format only.
  • Garbled or missing CJK characters: the scan is too low-resolution. Re-scan at 150–300 DPI. Qwen2-VL reads up to 4096Γ—4096, so higher input resolution directly improves Chinese and Japanese accuracy.
  • CUDA out of memory: the model does not fit your VRAM. Drop to Qwen2-VL 2B (~3 GB) or run on Apple Silicon, which shares unified memory between CPU and GPU.
  • Slow first response, then fast: this is normal. The vision encoder processes the whole image on the first token; text then generates at near-normal speed.
  • Wrong numbers extracted from an invoice or chart: local vision models misread digits on noisy input. Increase scan quality and always verify numeric output against the source.
  • A PDF will not load: no local vision model accepts PDF directly. Convert pages to PNG or JPEG first (with pdf2image or pypdfium2), then send each page as a separate image.
  • LM Studio shows "failed to load model": either insufficient VRAM, or you downloaded a non-vision GGUF. Confirm the model card lists vision support and pick the Q4_K_M build.

πŸ’‘Tip: Run `ollama ps` to see which models are loaded in VRAM and how much memory each uses. Use `ollama stop qwen2-vl:7b` to unload the model before switching to the 72B.

FAQ

What is the minimum hardware to run Qwen2-VL locally?

Qwen2-VL 7B at Q4_K_M quantization needs 8 GB of VRAM (RTX 4060, RTX 3060 12 GB, or RTX 2080). The smaller Qwen2-VL 2B runs in 4 GB. The 72B model needs ~48 GB β€” Apple Silicon with 64 GB+ unified memory or two 24 GB GPUs. Apple Silicon with 16 GB+ unified memory runs the 7B model comfortably.

Is Qwen2-VL better than LLaVA for OCR?

Yes, especially for non-English text. Qwen2-VL ties MiniCPM-V 2.6 and beats LLaVA 1.6 and Llama 3.2 Vision 11B on Chinese, Japanese, and Korean OCR. Its native 4096Γ—4096 resolution reads high-DPI scans without downsampling, while LLaVA 1.6 caps at 672Γ—672. LLaVA still has the largest community and the most tutorials.

Can Qwen2-VL read PDFs directly?

No. No local vision model accepts PDF input directly. Convert each PDF page to a PNG or JPEG image first (using pdf2image or pypdfium2), then send each page as a separate image request. For a 10-page PDF you send 10 image queries and combine the results.

How do I send an image to Qwen2-VL through Ollama?

Two ways. In interactive mode (`ollama run qwen2-vl:7b`), type the image file path inside your prompt β€” Ollama detects it and loads the image. Through the API, POST to /api/generate with a base64-encoded `images` array. The base64 string must not include the `data:` prefix.

Does Qwen2-VL run fully offline?

Yes. After the one-time model download, Qwen2-VL runs entirely on your machine β€” no API key and no cloud account. No image is uploaded anywhere, which keeps document processing inside your infrastructure. See the Qwen local GDPR setup guide for the compliance implications.

How many images can Qwen2-VL process at once?

Up to 8 images per request β€” the highest multi-image capacity among local vision models. This makes it well suited to comparing document versions, spotting differences, or summarizing a short video sampled at one frame per second.

Qwen2-VL or Llama 3.2 Vision β€” which should I choose?

Choose Qwen2-VL for Chinese, Japanese, or Korean documents, high-DPI scans, or small fonts β€” and because the 7B fits in 6 GB of VRAM versus 8 GB for Llama 3.2 Vision 11B. Choose Llama 3.2 Vision 11B for English-only general photo Q&A, where the two are comparable.

Why are the characters garbled in my OCR output?

Almost always a low-resolution scan. Qwen2-VL reads up to 4096Γ—4096 natively, so re-scanning the document at 150–300 DPI usually fixes garbled or missing characters. Low-quality input is the single biggest cause of OCR errors on every local vision model.

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Join the PromptQuorum Waitlist β†’

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

Run Qwen2-VL Locally 2026: OCR & Vision Setup Guide