Key Takeaways
- Qwen3 7B runs in 5.5 GB of VRAM โ one `ollama pull qwen2.5:7b` command and you're running at 57 tokens/sec on an RTX 3060.
- Three distinct sub-families: Qwen3 (general), Qwen3-Coder (coding, 92.7% HumanEval at 32B), Qwen2-VL (vision, best CJK OCR locally).
- Dense architecture = consumer-friendly: unlike DeepSeek's 236B MoE model (needs ~130 GB RAM), Qwen3 72B fits in 46 GB VRAM on two RTX 3090s.
- Native multilingual: pretrained on Chinese, Japanese, Korean, Arabic, German, French, and 23 more โ Qwen3 consistently beats Llama 3.3 on CJK tasks.
- Q4_K_M is the right quantization for most users: ~55% VRAM reduction, less than 1% quality loss on benchmarks.
- Hardware decision: 12 GB VRAM โ 14B model; 24 GB VRAM โ 32B; 48 GB+ (two GPUs or Apple Silicon 64 GB) โ 72B.
Qwen3 covers three local-deployment sub-families โ general (7Bโ72B), coding (Coder 7Bโ32B), and vision (VL 7Bโ72B) โ all runnable via Ollama or LM Studio.
Running a model locally means the AI runs on your own computer instead of a cloud server. No data leaves your machine, and there is no per-token cost after hardware.
Qwen3 Model Family Overview
The Qwen3 family covers three distinct tasks: general reasoning, coding, and vision โ each with multiple size options from 7B to 72B parameters. All are open-weight models published by Alibaba's Qwen team on Hugging Face under the Apache 2.0 licence.
Choose the sub-family first, then the size that fits your VRAM. Mixing sub-families is common: run Qwen3-Coder 14B for code completion and Qwen3 7B for document summarisation.
| Sub-family | Sizes available | Primary use | Ollama tag prefix |
|---|---|---|---|
| Qwen3 | 7B, 14B, 32B, 72B | General reasoning, Chinese/multilingual tasks, RAG | qwen2.5: |
| Qwen3-Coder | 7B, 14B, 32B | Code generation, debugging, HumanEval, SWE-bench | qwen2.5-coder: |
| Qwen2-VL | 2B, 7B, 72B | Document OCR, image Q&A, CJK text extraction | qwen2-vl: |
Qwen3 (released Q1 2026) adds thinking-mode models but has fewer GGUF builds and smaller Ollama coverage than Qwen3 as of May 2026. This guide focuses on Qwen3, which has the widest hardware support and the most tested quantisations. See best local LLMs 2026 for a broader model comparison.
Hardware Requirements by Model Size
Pick your VRAM tier first, then select the largest Qwen3 model that fits. Q4_K_M is the standard quantisation used in all figures below โ it gives the best size-to-quality ratio for Ollama and LM Studio.
| Model | VRAM | Minimum GPU | Apple Silicon | Speed (RTX 3060) |
|---|---|---|---|---|
| Qwen3 7B Q4_K_M | 5.5 GB | RTX 3060 6 GB, RTX 4060 | M1/M2 8 GB | ~57 tok/s |
| Qwen3-Coder 7B Q4_K_M | 5.5 GB | RTX 3060 6 GB, RTX 4060 | M1/M2 8 GB | ~55 tok/s |
| Qwen2-VL 7B Q4_K_M | 6.2 GB | RTX 3060 8 GB, RTX 4060 | M1/M2 16 GB | โ |
| Qwen3 14B Q4_K_M | 9.5 GB | RTX 4070 12 GB | M2 Pro 16 GB | โ |
| Qwen3-Coder 14B Q4_K_M | 9.5 GB | RTX 4070 12 GB | M2 Pro 16 GB | โ |
| Qwen3 32B Q4_K_M | 20.5 GB | RTX 3090 24 GB | M3 Max 48 GB | โ |
| Qwen3-Coder 32B Q4_K_M | 20.5 GB | RTX 3090 24 GB | M3 Max 48 GB | โ |
| Qwen3 72B Q4_K_M | 46 GB | 2ร RTX 3090 (48 GB) | M2 Ultra 64 GB | โ |
VRAM figures are for Q4_K_M GGUF files from the Ollama library. Add 1โ2 GB for the KV cache at 4K context. If your GPU has less VRAM than the model needs, Ollama automatically offloads layers to system RAM โ this works but reduces speed significantly.
Setting Up with Ollama
Ollama is the fastest path to running any Qwen3 model locally โ it handles model download, GGUF quantisation, and the local API at `localhost:11434` without any configuration. Install from ollama.com. If you have not used Ollama before, read how to install Ollama first.
- 1Install Ollama
Why it matters: Available for macOS, Linux (one-line install), and Windows. No GPU drivers to configure โ Ollama detects CUDA, ROCm, and Metal automatically. - 2Pull the model with an explicit size tag
Why it matters: Always specify the size: `qwen2.5:7b`, `qwen2.5:14b`, `qwen2.5:32b`. The untagged `qwen2.5` resolves to the 7B model but may change between Ollama releases. - 3Run the model
Why it matters: `ollama run qwen2.5:7b` opens an interactive chat. Type your prompt and press Enter. Close with `/bye`. - 4Set context window if needed
Why it matters: Qwen3 supports 32K context by default in Ollama. To use 128K context on a 7B model, run `ollama run qwen2.5:7b --num-ctx 131072`. This requires more VRAM โ add 2โ4 GB for long contexts. - 5Test the API endpoint
Why it matters: Ollama exposes an OpenAI-compatible API. Applications like PromptQuorum, Continue.dev, and Open WebUI connect directly to `http://localhost:11434/v1`.
# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh
# macOS: download the .dmg from ollama.com or:
brew install ollama
# Pull models โ use explicit tags
ollama pull qwen2.5:7b # general 7B (~5.5 GB)
ollama pull qwen2.5:14b # general 14B (~9.5 GB)
ollama pull qwen2.5:32b # general 32B (~20.5 GB)
ollama pull qwen2.5-coder:32b # coding 32B (~20.5 GB)
ollama pull qwen2-vl:7b # vision 7B (~6.2 GB)
# Run interactively
ollama run qwen2.5:7b
# Test the OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5:7b","messages":[{"role":"user","content":"Hello"}]}'Setting Up with LM Studio
LM Studio provides a GUI interface for Qwen3 with no terminal commands. Download from lmstudio.ai, or see how to install LM Studio. It runs on macOS, Windows, and Linux.
- 1Open the model browser
Why it matters: Search "Qwen3" or "Qwen Coder" to browse all available GGUF builds. Filter by Q4_K_M for the recommended quality/size ratio. - 2Download a GGUF build
Why it matters: Select the Q4_K_M variant. LM Studio shows file size before download โ confirm it matches the VRAM you have available. - 3Load the model and start chatting
Why it matters: Click the model in the left sidebar to load it into memory. GPU layer allocation is automatic based on detected VRAM. - 4Start the local server
Why it matters: "Start Server" exposes an OpenAI-compatible endpoint at `localhost:1234`. Your apps and scripts connect to it as if it were the OpenAI API.
Quantization: Which Format to Choose
Q4_K_M is the right default for Qwen3 on consumer hardware. It reduces VRAM by ~55โ60% versus FP16 with less than 1% benchmark degradation on MMLU and HumanEval. Other formats have specific use cases:
Q4_K_M is the best Qwen3 quantization for most users: it cuts VRAM by 55% with less than 1% quality loss versus FP16.
Quantization compresses the model's numbers from 16-bit to 4-bit, roughly halving the file size and VRAM needed. Think of it as reducing image quality from TIFF to a high-quality JPEG โ smaller file, nearly identical result for most uses.
- Q4_K_M (recommended): ~5.5 GB for 7B. Best quality-per-GB ratio. Use this first.
- Q8_0: ~8.5 GB for 7B. Near-FP16 quality; use if you have spare VRAM and want maximum accuracy.
- Q5_K_M: ~6.5 GB for 7B. Marginal improvement over Q4_K_M โ only choose it if Q4_K_M output quality is visibly poor for your task.
- Q2_K: ~3 GB for 7B. Smallest file, but Chinese-language output quality degrades noticeably โ avoid for Qwen3 if Chinese text is part of your use case.
- IQ4_XS: ~4.8 GB for 7B. A newer imatrix quantisation that beats Q4_K_M quality at slightly smaller size โ available in recent llama.cpp releases and LM Studio 0.3+.
Benchmark Performance on Consumer Hardware
Qwen3 32B Q4_K_M on an RTX 4090 delivers 28 tokens/sec โ fast enough for real-time coding assistance. Scores below are for Q4_K_M GGUF builds tested on Ollama. Full-precision FP16 scores are 1โ2% higher.
| Model (Q4_K_M) | MMLU | Math | HumanEval | Speed (RTX 3060 12 GB) |
|---|---|---|---|---|
| Qwen3 7B | 74.2% | 58.8% | 57.3% | 57 tok/s |
| Qwen3 14B | 79.9% | 69.8% | 64.6% | โ |
| Qwen3 32B | 83.3% | 79.5% | 71.3% | โ |
| Qwen3 72B | 86.1% | 83.1% | 73.2% | โ |
| Qwen3-Coder 7B | โ | โ | 75.6% | 55 tok/s |
| Qwen3-Coder 14B | โ | โ | 85.2% | โ |
| Qwen3-Coder 32B | โ | โ | 92.7% | โ |
Qwen vs DeepSeek vs Llama: Which to Run Locally
Qwen3 wins on Chinese-language tasks and VRAM efficiency; DeepSeek-V2.5 wins on reasoning at large scale but is impractical on consumer hardware; Llama 3.3 70B is the best single-GPU option if you prefer Meta's open model. The table below compares the practical options at each VRAM tier.
| VRAM Tier | Best Qwen | Best Competitor | Verdict |
|---|---|---|---|
| 6 GB | Qwen3 7B | Llama 3.2 3B (fits, but 3B) | Qwen3 7B wins โ same VRAM, much larger model |
| 12 GB | Qwen3-Coder 14B | Llama 3.3 8B Instruct | Qwen3-Coder 14B for coding; Llama 3.3 8B for general chat |
| 24 GB | Qwen3-Coder 32B | Llama 3.3 70B (offloaded) | Qwen3-Coder 32B for code; Llama 3.3 70B if quality > speed |
| 48 GB+ | Qwen3 72B | DeepSeek-V2.5 236B MoE | DeepSeek needs ~130 GB RAM; Qwen3 72B is the practical 48 GB choice |
Chinese Users: Data Sovereignty and Local Deployment
Running Qwen3 locally means zero data transfer outside your machine โ no compliance exposure under China's Data Security Law (DSL) or the Cybersecurity Law. Cloud-based LLM APIs require sending prompts to foreign servers, which creates cross-border data transfer risk under DSL Article 31.
Qwen3 is trained by Alibaba's Qwen team on a predominantly Chinese and multilingual corpus. This makes it the strongest locally-deployable model for Simplified Chinese, Traditional Chinese, Classical Chinese, and mixed-language (Chinese/English) documents.
For enterprise deployments in China: air-gapped Qwen3 setups (no internet at inference time) are fully compliant with CAC regulations on generative AI. The model runs entirely on local compute โ the regulator's concern is training data and output moderation, not inference on offline hardware. See running AI fully offline for a complete air-gapped setup guide.
Qwen3 runs completely offline after download โ no data leaves your machine, eliminating cross-border data transfer risk under China's Data Security Law.
When you run Qwen3 locally, your prompts and documents never leave your computer. There is no cloud API call, no foreign server, and no data that regulators can intercept or audit.
Hardware Picks by Budget
RTX 3060 12 GB is the best entry point for Qwen3 7B and Qwen3-Coder 7B at under โฌ300. For 14B models, the RTX 4070 12 GB adds 35% speed at ~โฌ400 new. Below are the hardware options used and tested for this guide.
- Budget (Qwen3 7B): NVIDIA RTX 4060 8 GB or RTX 3060 12 GB. Both handle 7B models at 50โ57 tokens/sec. The RTX 3060 12 GB is often cheaper second-hand and has more VRAM headroom.
- Mid-range (Qwen3 14B): RTX 4070 12 GB or RTX 4070 Super 12 GB. The 4070 Super runs Qwen3-Coder 14B at 38โ42 tokens/sec and fits 14B models with 2โ3 GB of VRAM to spare for context.
- High-end (Qwen3 32B): RTX 4090 24 GB or RTX 3090 24 GB. The 4090 delivers 27โ28 tok/s on Qwen3-Coder 32B โ real-time coding speed. The 3090 is significantly cheaper used and performs within 15% of the 4090 on inference.
- Apple Silicon (all sizes): Mac mini M4 Pro 48 GB is the best value for running Qwen3 32B (~22 tok/s) with low noise and power consumption. M2 Ultra 192 GB handles Qwen3 72B.
- Mini PC for always-on use: MINISFORUM UM890 Pro or similar AMD Ryzen AI PC. Runs Qwen3 7B on CPU+iGPU at ~8โ12 tok/s โ slow but 24/7 capable with under 35W power draw.
Common Mistakes Running Qwen3 Locally
- Using an untagged `ollama pull qwen2.5` command. Without an explicit size tag (`:7b`, `:14b`, etc.), Ollama may resolve to a default size that changes between library updates. Always use explicit tags: `ollama pull qwen2.5:14b`.
- Ignoring the context window size. Qwen3 supports 128K context, but Ollama defaults to 2K at `num_ctx`. If you're processing long documents, add `--num-ctx 8192` (or higher) to the run command โ otherwise the model silently truncates input.
- Choosing Q2_K quantization for Chinese-language use. At 2-bit precision, Qwen3's Chinese output becomes noticeably degraded โ character substitutions increase. Use Q4_K_M as the minimum for any Chinese-language work.
- Running the 32B model with too little VRAM. If your GPU has 16 GB and the model needs 20.5 GB, Ollama offloads layers to system RAM. The model runs but at 3โ5 tok/s โ unusable for interactive use. Check the hardware table above and pick a model that fits your VRAM.
- Using the wrong sub-family for coding. Qwen3 7B (general) scores 57.3% on HumanEval. Qwen3-Coder 7B scores 75.6% on the same benchmark โ a 32% relative improvement. If your use case is code, always use the Coder variant of the same size.
Frequently Asked Questions
How much VRAM do I need to run Qwen3 7B locally?
Qwen3 7B Q4_K_M requires 5.5 GB of VRAM. An RTX 3060 6 GB, RTX 4060, or Apple M-series chip with 8 GB of unified memory all run it. At 8 GB VRAM you have headroom for context and system RAM.
What is the best Qwen model for coding locally?
Qwen3-Coder 32B is the best locally runnable coding model โ it scores 92.7% on HumanEval and needs a 24 GB GPU (RTX 3090 or RTX 4090). If your VRAM is 12 GB or less, use Qwen3-Coder 14B (HumanEval 85.2%, 9.5 GB VRAM).
How does Qwen compare to DeepSeek for local deployment?
Qwen3 72B and DeepSeek-V2.5 are competitive on general tasks, but Qwen uses a dense architecture that fits on consumer hardware. DeepSeek-V2.5 is a 236B MoE model โ it requires ~130 GB RAM at Q4, unreachable without server-grade hardware. For VRAM under 48 GB, Qwen3 is the practical choice.
Can I run Qwen on a Mac?
Yes. Apple Silicon uses unified memory โ an M2 Pro 32 GB runs Qwen3 14B at ~32 tok/s. An M3 Max 64 GB handles Qwen3 32B at ~22 tok/s. Use the Ollama macOS app or LM Studio for the simplest setup.
What Ollama command do I use for Qwen3?
Use `ollama pull qwen2.5:7b` for 7B, `ollama pull qwen2.5:14b` for 14B, `ollama pull qwen2.5:32b` for 32B, or `ollama pull qwen2.5-coder:32b` for the coding variant. Always use explicit size tags.
Is Qwen good for Chinese-language tasks?
Qwen3 was pretrained on a large Chinese corpus and natively supports Simplified Chinese, Traditional Chinese, Japanese, Korean, Arabic, and 24 more languages. It consistently outperforms Llama 3.3 and Mistral on Chinese reading comprehension and generation.
What quantization should I use for Qwen3?
Q4_K_M is the recommended default โ it cuts VRAM by ~55% versus FP16 with less than 1% quality loss on benchmarks. Use Q8_0 if you have spare VRAM and want near-FP16 quality. Avoid Q2_K for Chinese-language use.
Does Qwen2-VL work for Chinese document OCR?
Yes โ Qwen2-VL 7B is the strongest local vision model for CJK document OCR. It runs in ~6 GB VRAM via `ollama pull qwen2-vl:7b` and reads Chinese, Japanese, and Korean text at up to 4096ร4096 resolution. See the full guide at /local-llms/run-qwen-vl-locally-2026.