Key Takeaways
- Qwen2.5 7B runs in 5.5 GB of VRAM β one `ollama pull qwen2.5:7b` command and you're running at 57 tokens/sec on an RTX 3060.
- Three distinct sub-families: Qwen2.5 (general), Qwen2.5-Coder (coding, 92.7% HumanEval at 32B), Qwen2-VL (vision, best CJK OCR locally).
- Dense architecture = consumer-friendly: unlike DeepSeek's 236B MoE model (needs ~130 GB RAM), Qwen2.5 72B fits in 46 GB VRAM on two RTX 3090s.
- Native multilingual: pretrained on Chinese, Japanese, Korean, Arabic, German, French, and 23 more β Qwen2.5 consistently beats Llama 3.3 on CJK tasks.
- Q4_K_M is the right quantization for most users: ~55% VRAM reduction, less than 1% quality loss on benchmarks.
- Hardware decision: 12 GB VRAM β 14B model; 24 GB VRAM β 32B; 48 GB+ (two GPUs or Apple Silicon 64 GB) β 72B.
π In One Sentence
Qwen2.5 covers three local-deployment sub-families β general (7Bβ72B), coding (Coder 7Bβ32B), and vision (VL 7Bβ72B) β all runnable via Ollama or LM Studio.
π¬ In Plain Terms
Running a model locally means the AI runs on your own computer instead of a cloud server. No data leaves your machine, and there is no per-token cost after hardware.
Qwen2.5 Model Family Overview
The Qwen2.5 family covers three distinct tasks: general reasoning, coding, and vision β each with multiple size options from 7B to 72B parameters. All are open-weight models published by Alibaba's Qwen team on Hugging Face under the Apache 2.0 licence.
Choose the sub-family first, then the size that fits your VRAM. Mixing sub-families is common: run Qwen2.5-Coder 14B for code completion and Qwen2.5 7B for document summarisation.
| Sub-family | Sizes available | Primary use | Ollama tag prefix |
|---|---|---|---|
| Qwen2.5 | 7B, 14B, 32B, 72B | General reasoning, Chinese/multilingual tasks, RAG | qwen2.5: |
| Qwen2.5-Coder | 7B, 14B, 32B | Code generation, debugging, HumanEval, SWE-bench | qwen2.5-coder: |
| Qwen2-VL | 2B, 7B, 72B | Document OCR, image Q&A, CJK text extraction | qwen2-vl: |
Qwen3 (released Q1 2026) adds thinking-mode models but has fewer GGUF builds and smaller Ollama coverage than Qwen2.5 as of May 2026. This guide focuses on Qwen2.5, which has the widest hardware support and the most tested quantisations. See best local LLMs 2026 for a broader model comparison.
Hardware Requirements by Model Size
Pick your VRAM tier first, then select the largest Qwen2.5 model that fits. Q4_K_M is the standard quantisation used in all figures below β it gives the best size-to-quality ratio for Ollama and LM Studio.
| Model | VRAM | Minimum GPU | Apple Silicon | Speed (RTX 3060) |
|---|---|---|---|---|
| Qwen2.5 7B Q4_K_M | 5.5 GB | RTX 3060 6 GB, RTX 4060 | M1/M2 8 GB | ~57 tok/s |
| Qwen2.5-Coder 7B Q4_K_M | 5.5 GB | RTX 3060 6 GB, RTX 4060 | M1/M2 8 GB | ~55 tok/s |
| Qwen2-VL 7B Q4_K_M | 6.2 GB | RTX 3060 8 GB, RTX 4060 | M1/M2 16 GB | β |
| Qwen2.5 14B Q4_K_M | 9.5 GB | RTX 4070 12 GB | M2 Pro 16 GB | β |
| Qwen2.5-Coder 14B Q4_K_M | 9.5 GB | RTX 4070 12 GB | M2 Pro 16 GB | β |
| Qwen2.5 32B Q4_K_M | 20.5 GB | RTX 3090 24 GB | M3 Max 48 GB | β |
| Qwen2.5-Coder 32B Q4_K_M | 20.5 GB | RTX 3090 24 GB | M3 Max 48 GB | β |
| Qwen2.5 72B Q4_K_M | 46 GB | 2Γ RTX 3090 (48 GB) | M2 Ultra 64 GB | β |
VRAM figures are for Q4_K_M GGUF files from the Ollama library. Add 1β2 GB for the KV cache at 4K context. If your GPU has less VRAM than the model needs, Ollama automatically offloads layers to system RAM β this works but reduces speed significantly.
Setting Up with Ollama
Ollama is the fastest path to running any Qwen2.5 model locally β it handles model download, GGUF quantisation, and the local API at `localhost:11434` without any configuration. Install from ollama.com. If you have not used Ollama before, read how to install Ollama first.
- 1Install Ollama
Why it matters: Available for macOS, Linux (one-line install), and Windows. No GPU drivers to configure β Ollama detects CUDA, ROCm, and Metal automatically. - 2Pull the model with an explicit size tag
Why it matters: Always specify the size: `qwen2.5:7b`, `qwen2.5:14b`, `qwen2.5:32b`. The untagged `qwen2.5` resolves to the 7B model but may change between Ollama releases. - 3Run the model
Why it matters: `ollama run qwen2.5:7b` opens an interactive chat. Type your prompt and press Enter. Close with `/bye`. - 4Set context window if needed
Why it matters: Qwen2.5 supports 32K context by default in Ollama. To use 128K context on a 7B model, run `ollama run qwen2.5:7b --num-ctx 131072`. This requires more VRAM β add 2β4 GB for long contexts. - 5Test the API endpoint
Why it matters: Ollama exposes an OpenAI-compatible API. Applications like PromptQuorum, Continue.dev, and Open WebUI connect directly to `http://localhost:11434/v1`.
# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh
# macOS: download the .dmg from ollama.com or:
brew install ollama
# Pull models β use explicit tags
ollama pull qwen2.5:7b # general 7B (~5.5 GB)
ollama pull qwen2.5:14b # general 14B (~9.5 GB)
ollama pull qwen2.5:32b # general 32B (~20.5 GB)
ollama pull qwen2.5-coder:32b # coding 32B (~20.5 GB)
ollama pull qwen2-vl:7b # vision 7B (~6.2 GB)
# Run interactively
ollama run qwen2.5:7b
# Test the OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5:7b","messages":[{"role":"user","content":"Hello"}]}'Setting Up with LM Studio
LM Studio provides a GUI interface for Qwen2.5 with no terminal commands. Download from lmstudio.ai, or see how to install LM Studio. It runs on macOS, Windows, and Linux.
- 1Open the model browser
Why it matters: Search "Qwen2.5" or "Qwen Coder" to browse all available GGUF builds. Filter by Q4_K_M for the recommended quality/size ratio. - 2Download a GGUF build
Why it matters: Select the Q4_K_M variant. LM Studio shows file size before download β confirm it matches the VRAM you have available. - 3Load the model and start chatting
Why it matters: Click the model in the left sidebar to load it into memory. GPU layer allocation is automatic based on detected VRAM. - 4Start the local server
Why it matters: "Start Server" exposes an OpenAI-compatible endpoint at `localhost:1234`. Your apps and scripts connect to it as if it were the OpenAI API.
Quantization: Which Format to Choose
Q4_K_M is the right default for Qwen2.5 on consumer hardware. It reduces VRAM by ~55β60% versus FP16 with less than 1% benchmark degradation on MMLU and HumanEval. Other formats have specific use cases:
π In One Sentence
Q4_K_M is the best Qwen2.5 quantization for most users: it cuts VRAM by 55% with less than 1% quality loss versus FP16.
π¬ In Plain Terms
Quantization compresses the model's numbers from 16-bit to 4-bit, roughly halving the file size and VRAM needed. Think of it as reducing image quality from TIFF to a high-quality JPEG β smaller file, nearly identical result for most uses.
- Q4_K_M (recommended): ~5.5 GB for 7B. Best quality-per-GB ratio. Use this first.
- Q8_0: ~8.5 GB for 7B. Near-FP16 quality; use if you have spare VRAM and want maximum accuracy.
- Q5_K_M: ~6.5 GB for 7B. Marginal improvement over Q4_K_M β only choose it if Q4_K_M output quality is visibly poor for your task.
- Q2_K: ~3 GB for 7B. Smallest file, but Chinese-language output quality degrades noticeably β avoid for Qwen2.5 if Chinese text is part of your use case.
- IQ4_XS: ~4.8 GB for 7B. A newer imatrix quantisation that beats Q4_K_M quality at slightly smaller size β available in recent llama.cpp releases and LM Studio 0.3+.
Benchmark Performance on Consumer Hardware
Qwen2.5 32B Q4_K_M on an RTX 4090 delivers 28 tokens/sec β fast enough for real-time coding assistance. Scores below are for Q4_K_M GGUF builds tested on Ollama. Full-precision FP16 scores are 1β2% higher.
| Model (Q4_K_M) | MMLU | Math | HumanEval | Speed (RTX 3060 12 GB) |
|---|---|---|---|---|
| Qwen2.5 7B | 74.2% | 58.8% | 57.3% | 57 tok/s |
| Qwen2.5 14B | 79.9% | 69.8% | 64.6% | β |
| Qwen2.5 32B | 83.3% | 79.5% | 71.3% | β |
| Qwen2.5 72B | 86.1% | 83.1% | 73.2% | β |
| Qwen2.5-Coder 7B | β | β | 75.6% | 55 tok/s |
| Qwen2.5-Coder 14B | β | β | 85.2% | β |
| Qwen2.5-Coder 32B | β | β | 92.7% | β |
Qwen vs DeepSeek vs Llama: Which to Run Locally
Qwen2.5 wins on Chinese-language tasks and VRAM efficiency; DeepSeek-V2.5 wins on reasoning at large scale but is impractical on consumer hardware; Llama 3.3 70B is the best single-GPU option if you prefer Meta's open model. The table below compares the practical options at each VRAM tier.
| VRAM Tier | Best Qwen | Best Competitor | Verdict |
|---|---|---|---|
| 6 GB | Qwen2.5 7B | Llama 3.2 3B (fits, but 3B) | Qwen2.5 7B wins β same VRAM, much larger model |
| 12 GB | Qwen2.5-Coder 14B | Llama 3.3 8B Instruct | Qwen2.5-Coder 14B for coding; Llama 3.3 8B for general chat |
| 24 GB | Qwen2.5-Coder 32B | Llama 3.3 70B (offloaded) | Qwen2.5-Coder 32B for code; Llama 3.3 70B if quality > speed |
| 48 GB+ | Qwen2.5 72B | DeepSeek-V2.5 236B MoE | DeepSeek needs ~130 GB RAM; Qwen2.5 72B is the practical 48 GB choice |
Chinese Users: Data Sovereignty and Local Deployment
Running Qwen2.5 locally means zero data transfer outside your machine β no compliance exposure under China's Data Security Law (DSL) or the Cybersecurity Law. Cloud-based LLM APIs require sending prompts to foreign servers, which creates cross-border data transfer risk under DSL Article 31.
Qwen2.5 is trained by Alibaba's Qwen team on a predominantly Chinese and multilingual corpus. This makes it the strongest locally-deployable model for Simplified Chinese, Traditional Chinese, Classical Chinese, and mixed-language (Chinese/English) documents.
For enterprise deployments in China: air-gapped Qwen2.5 setups (no internet at inference time) are fully compliant with CAC regulations on generative AI. The model runs entirely on local compute β the regulator's concern is training data and output moderation, not inference on offline hardware. See running AI fully offline for a complete air-gapped setup guide.
π In One Sentence
Qwen2.5 runs completely offline after download β no data leaves your machine, eliminating cross-border data transfer risk under China's Data Security Law.
π¬ In Plain Terms
When you run Qwen2.5 locally, your prompts and documents never leave your computer. There is no cloud API call, no foreign server, and no data that regulators can intercept or audit.
Hardware Picks by Budget
RTX 3060 12 GB is the best entry point for Qwen2.5 7B and Qwen2.5-Coder 7B at under β¬300. For 14B models, the RTX 4070 12 GB adds 35% speed at ~β¬400 new. Below are the hardware options used and tested for this guide.
- Budget (Qwen2.5 7B): NVIDIA RTX 4060 8 GB or RTX 3060 12 GB. Both handle 7B models at 50β57 tokens/sec. The RTX 3060 12 GB is often cheaper second-hand and has more VRAM headroom.
- Mid-range (Qwen2.5 14B): RTX 4070 12 GB or RTX 4070 Super 12 GB. The 4070 Super runs Qwen2.5-Coder 14B at 38β42 tokens/sec and fits 14B models with 2β3 GB of VRAM to spare for context.
- High-end (Qwen2.5 32B): RTX 4090 24 GB or RTX 3090 24 GB. The 4090 delivers 27β28 tok/s on Qwen2.5-Coder 32B β real-time coding speed. The 3090 is significantly cheaper used and performs within 15% of the 4090 on inference.
- Apple Silicon (all sizes): Mac mini M4 Pro 48 GB is the best value for running Qwen2.5 32B (~22 tok/s) with low noise and power consumption. M2 Ultra 192 GB handles Qwen2.5 72B.
- Mini PC for always-on use: MINISFORUM UM890 Pro or similar AMD Ryzen AI PC. Runs Qwen2.5 7B on CPU+iGPU at ~8β12 tok/s β slow but 24/7 capable with under 35W power draw.
Common Mistakes Running Qwen2.5 Locally
- Using an untagged `ollama pull qwen2.5` command. Without an explicit size tag (`:7b`, `:14b`, etc.), Ollama may resolve to a default size that changes between library updates. Always use explicit tags: `ollama pull qwen2.5:14b`.
- Ignoring the context window size. Qwen2.5 supports 128K context, but Ollama defaults to 2K at `num_ctx`. If you're processing long documents, add `--num-ctx 8192` (or higher) to the run command β otherwise the model silently truncates input.
- Choosing Q2_K quantization for Chinese-language use. At 2-bit precision, Qwen2.5's Chinese output becomes noticeably degraded β character substitutions increase. Use Q4_K_M as the minimum for any Chinese-language work.
- Running the 32B model with too little VRAM. If your GPU has 16 GB and the model needs 20.5 GB, Ollama offloads layers to system RAM. The model runs but at 3β5 tok/s β unusable for interactive use. Check the hardware table above and pick a model that fits your VRAM.
- Using the wrong sub-family for coding. Qwen2.5 7B (general) scores 57.3% on HumanEval. Qwen2.5-Coder 7B scores 75.6% on the same benchmark β a 32% relative improvement. If your use case is code, always use the Coder variant of the same size.
Frequently Asked Questions
How much VRAM do I need to run Qwen2.5 7B locally?
Qwen2.5 7B Q4_K_M requires 5.5 GB of VRAM. An RTX 3060 6 GB, RTX 4060, or Apple M-series chip with 8 GB of unified memory all run it. At 8 GB VRAM you have headroom for context and system RAM.
What is the best Qwen model for coding locally?
Qwen2.5-Coder 32B is the best locally runnable coding model β it scores 92.7% on HumanEval and needs a 24 GB GPU (RTX 3090 or RTX 4090). If your VRAM is 12 GB or less, use Qwen2.5-Coder 14B (HumanEval 85.2%, 9.5 GB VRAM).
How does Qwen compare to DeepSeek for local deployment?
Qwen2.5 72B and DeepSeek-V2.5 are competitive on general tasks, but Qwen uses a dense architecture that fits on consumer hardware. DeepSeek-V2.5 is a 236B MoE model β it requires ~130 GB RAM at Q4, unreachable without server-grade hardware. For VRAM under 48 GB, Qwen2.5 is the practical choice.
Can I run Qwen on a Mac?
Yes. Apple Silicon uses unified memory β an M2 Pro 32 GB runs Qwen2.5 14B at ~32 tok/s. An M3 Max 64 GB handles Qwen2.5 32B at ~22 tok/s. Use the Ollama macOS app or LM Studio for the simplest setup.
What Ollama command do I use for Qwen2.5?
Use `ollama pull qwen2.5:7b` for 7B, `ollama pull qwen2.5:14b` for 14B, `ollama pull qwen2.5:32b` for 32B, or `ollama pull qwen2.5-coder:32b` for the coding variant. Always use explicit size tags.
Is Qwen good for Chinese-language tasks?
Qwen2.5 was pretrained on a large Chinese corpus and natively supports Simplified Chinese, Traditional Chinese, Japanese, Korean, Arabic, and 24 more languages. It consistently outperforms Llama 3.3 and Mistral on Chinese reading comprehension and generation.
What quantization should I use for Qwen2.5?
Q4_K_M is the recommended default β it cuts VRAM by ~55% versus FP16 with less than 1% quality loss on benchmarks. Use Q8_0 if you have spare VRAM and want near-FP16 quality. Avoid Q2_K for Chinese-language use.
Does Qwen2-VL work for Chinese document OCR?
Yes β Qwen2-VL 7B is the strongest local vision model for CJK document OCR. It runs in ~6 GB VRAM via `ollama pull qwen2-vl:7b` and reads Chinese, Japanese, and Korean text at up to 4096Γ4096 resolution. See the full guide at /local-llms/run-qwen-vl-locally-2026.