Key Takeaways
- Qwen3 7B and 14B are consumer GPU targets β 8 GB and 16 GB VRAM respectively, running via Ollama in Docker
- Qwen3 32B needs an RTX 4090 24 GB; this is the largest single-card production deployment for most teams
- Qwen3 72B requires dual RTX 4090s, a high-RAM CPU build (128+ GB DDR5), or cloud rental β self-hosting costs ~$0.05β0.12/day amortized
- A Docker Compose stack with Ollama + Open WebUI + Nginx exposes an OpenAI-compatible API in under 10 minutes
- Always-on Qwen servers: Minisforum UM890 Pro ($429, Qwen3 7B on CPU) or AOOSTAR GEM12 Pro OCuLink + RTX 4060 Ti 16 GB (~$800 total)
- Cloud fallback: RunPod A40 48 GB at $0.44/hr handles Qwen3 72B β cheaper than buying dual RTX 4090s for occasional use
- This guide covers production deployment; for basic Ollama setup see the Qwen beginner guide
π In One Sentence
Deploy Qwen models in production using a Docker Compose stack that runs Ollama as the inference backend and exposes an OpenAI-compatible API endpoint.
π¬ In Plain Terms
Instead of running Qwen manually each time, Docker lets you set up a permanent server that stays on and accepts requests β just like using the ChatGPT API, but on your own hardware at no per-token cost.
Qwen Model Performance by Hardware β May 2026
Choose your hardware based on model size, not GPU brand. VRAM is the hard constraint: if the model does not fit, it will not run at GPU speed. The table below shows measured inference speeds at Q4_K_M quantization (the best quality-to-size ratio for Ollama deployments).
| Model | VRAM (Q4_K_M) | Min GPU | Speed (tok/s) | CPU fallback | Production-ready? |
|---|---|---|---|---|---|
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
Speeds measured on PCIe Gen 4 systems. NVLink improves dual-GPU throughput by ~15% on supported cards. Qwen3 72B at Q4_K_M with single A100 80 GB on RunPod: 18β22 tok/s.
Docker API Server Setup β Ollama + Open WebUI + Nginx
The fastest production Qwen stack is three containers: Ollama (inference), Open WebUI (UI), and Nginx (reverse proxy + auth). This setup takes under 10 minutes and exposes a permanent OpenAI-compatible API at http://your-server:11434/v1.
- 1Install Docker and Docker Compose
Why it matters: Containers keep Qwen isolated from your OS β no Python environment conflicts, easy updates. - 2Create docker-compose.yml with Ollama + Open WebUI services
Why it matters: The compose file manages GPU passthrough, port mapping, and restart policies in one place. - 3Set OLLAMA_HOST=0.0.0.0 in the Ollama container environment
Why it matters: Without this, Ollama only listens on localhost and will not accept API requests from other containers or hosts. - 4Pull your Qwen model: docker exec ollama ollama pull qwen3:7b
Why it matters: Models are stored in a Docker volume so they persist across container restarts. - 5Add Nginx as API gateway with basic auth for public-facing deployments
Why it matters: Exposing Ollama directly to the internet without auth allows anyone to run inference on your GPU. - 6Set container restart policy to unless-stopped
Why it matters: This ensures your Qwen server survives system reboots β critical for always-on mini PC deployments.
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_KEEP_ALIVE=-1
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open_webui_data:/app/backend/data
depends_on:
- ollama
volumes:
ollama_data:
open_webui_data:Multi-GPU Configuration for Qwen3 72B
Qwen3 72B at Q4_K_M requires 43.5 GB VRAM β one RTX 4090 (24 GB) is not enough. You need dual RTX 4090s (48 GB combined) or a single professional card (A100 80 GB, H100 80 GB). Ollama handles multi-GPU splitting natively; no code changes are needed.
- Ollama automatically splits the model across all available GPUs β set CUDA_VISIBLE_DEVICES=0,1 in the compose environment to target specific cards
- For dual RTX 4090s, both must be in the same PCIe bandwidth tier β a B650 or Z790 board with two PCIe Gen 4 x8 slots is the minimum
- NVLink between two RTX 4090s is not officially supported by NVIDIA on consumer cards but works on RTX 4090 Founders Edition pairs via third-party NVLink bridges β adds ~15% throughput
- vLLM is an alternative inference engine that uses tensor parallelism for more efficient multi-GPU utilization β use vLLM instead of Ollama for sustained 70B inference loads above 100 concurrent requests
- For occasional Qwen3 72B use, RunPod A40 48 GB at $0.44/hr is cheaper than a dual-RTX-4090 build ($3,800+)
# vLLM multi-GPU alternative (better for high-traffic 72B)
docker run --gpus all \
-p 8000:8000 \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-72B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--quantization awqProduction API Configuration
The Ollama API is OpenAI-compatible at /v1 β any application that calls the ChatGPT API works with your local Qwen deployment by changing one base URL. Key environment variables that affect production behavior:
- OLLAMA_KEEP_ALIVE=-1 β prevents the model from unloading after inactivity (default is 5 minutes, fatal for server deployments)
- OLLAMA_NUM_PARALLEL=4 β allows up to 4 concurrent inference requests; raise this if you have multiple VRAM GB headroom
- OLLAMA_MAX_LOADED_MODELS=1 β keep only one model in VRAM at a time on small GPU builds to prevent thrashing
- OLLAMA_FLASH_ATTENTION=1 β enables flash attention for 20β30% speed improvement on NVIDIA Ampere/Ada GPUs (RTX 3060 and newer)
- OLLAMA_GPU_OVERHEAD=512 β reserve 512 MB VRAM for OS and driver overhead; reduces OOM crashes on cards with exactly 8 or 16 GB
β οΈWarning: OLLAMA_KEEP_ALIVE=0 or omitting it causes the model to unload after each request. Your first request after a gap takes 10β30 seconds to reload. Always set OLLAMA_KEEP_ALIVE=-1 for API server deployments.
Cost Comparison: Self-Hosted vs Alibaba Cloud vs RunPod
Self-hosting beats cloud for sustained inference loads above 4 hours per day. Below 4 hours per day, cloud GPU rental is cheaper after hardware amortization. The table below uses a 3-year hardware amortization for self-hosted builds.
| Option | Qwen3 7B cost/day | Qwen3 72B cost/day | Upfront cost | Best for |
|---|---|---|---|---|
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
Always-On Qwen Server Hardware Recommendations
A mini PC running Qwen3 7B as a 24/7 API server costs $0.50β1.50/month in electricity β far cheaper than any cloud alternative. Two mini PC builds cover most always-on Qwen use cases:
- Budget (Qwen3 7B CPU inference): Minisforum UM890 Pro β AMD Ryzen 9 8945HS, 32 GB DDR5, 512 GB NVMe. ~$429 new. Qwen3 7B runs via Ollama CPU backend at 3β5 tok/s. Adequate for personal assistants and document summarization. 12W idle, 45W peak. Very quiet. Ships from US/EU warehouse.
- Recommended (Qwen3 14B GPU): AOOSTAR GEM12 Pro OCuLink β supports external GPU via OCuLink port. Pair with an RTX 4060 Ti 16 GB eGPU enclosure (~$340 GPU + $100 enclosure). Total ~$800. Runs Qwen3 14B at 16β18 tok/s. Significantly better than CPU fallback for interactive use.
- Power user (Qwen3 32B): Compact ATX desktop with RTX 4090 β examples: Fractal Node 804 case ($90), RTX 4090 (~$1,900 current market), Ryzen 9 7950X (~$600), 64 GB DDR5 (~$180). Total ~$2,800. Runs Qwen3 32B at 10β14 tok/s indefinitely.
Verdict: Which Deployment for Which Model Size
Choose your Qwen deployment path based on model size and daily usage hours β not on what hardware looks impressive.
Qwen Deployment Decision
Use a local LLM if:
- β’Qwen3 7B or 14B and you use it 4+ hours/day β buy a mini PC or GPU; cloud is more expensive
- β’You need < 80 ms latency for interactive coding or document workflows
- β’You are processing private data that must not leave your network
- β’You already have a desktop GPU with 12+ GB VRAM sitting idle
Use a cloud model if:
- β’Qwen3 72B for occasional use (< 4 hours/day) β RunPod A40 48 GB at $0.44/hr is far cheaper than a dual-GPU build
- β’You need to test Qwen3 72B before committing to a hardware purchase
- β’Your usage is bursty and unpredictable β cloud scales to zero when idle
- β’You are outside the US/EU and shipping costs or import duties make hardware expensive
Quick decision:
- βQwen3 7B daily: Minisforum UM890 Pro ($429)
- βQwen3 14B daily: AOOSTAR + RTX 4060 Ti (~$800)
- βQwen3 32B daily: compact ATX + RTX 4090 (~$2,800)
- βQwen3 72B occasional: RunPod A40 48 GB ($0.44/hr)
Related Guides
- Basic Qwen Ollama setup (beginner): /power-local-llm/run-qwen-locally-guide-2026
- GPU buying guide for local LLMs: /power-local-llm/best-gpu-buying-guide-local-llm-2026
- NAS storage for model files: /power-local-llm/best-nas-storage-local-ai-models-2026
- Cloud GPU comparison (Western providers): /power-local-llm/cloud-gpu-rental-guide-2026
Frequently Asked Questions
Can I run Qwen3 72B on a single RTX 4090?
No. Qwen3 72B at Q4_K_M quantization requires 43.5 GB VRAM. A single RTX 4090 has 24 GB. You need dual RTX 4090s (48 GB combined), an A100 80 GB, or cloud GPU rental. A single RTX 4090 can run Qwen3 32B at Q4_K_M (20.1 GB) with headroom.
What is the difference between Ollama and vLLM for production Qwen deployment?
Ollama is simpler to set up and handles multi-GPU splitting automatically β best for personal servers and teams with under 20 concurrent users. vLLM uses tensor parallelism and continuous batching, making it 2β4Γ more efficient under concurrent load β best for 100+ requests per hour or production APIs serving many users.
Does Ollama support multi-GPU inference for Qwen natively?
Yes, since Ollama 0.3.0 (2025). Set CUDA_VISIBLE_DEVICES=0,1 to specify which GPUs to use. Ollama splits the model automatically. For Qwen3 72B on dual RTX 4090, expect 5β8 tok/s β lower than a single A100 80 GB because the model must split across PCIe rather than NVLink in consumer configurations.
Is Alibaba Cloud cheaper than RunPod for Qwen inference?
Alibaba Cloud PAI costs $0.50β2.00/hr depending on GPU tier and region. RunPod A40 48 GB costs $0.44/hr. For Qwen specifically, Alibaba Cloud offers pre-configured Qwen inference environments with Qwen-optimized runtimes that can be 20β30% faster than generic Ollama β worth testing if you are already in the Alibaba Cloud ecosystem. For pure cost, RunPod spot instances are cheaper.
How much power does an always-on Qwen server use?
A Minisforum UM890 Pro running Qwen3 7B on CPU draws 12W idle and 45W under load. At US average electricity rates ($0.16/kWh), running it 24/7 costs ~$0.70β1.80/month. An RTX 4060 Ti 16 GB under load draws 165W β add mini PC idle (~25W) for ~190W total, or ~$7β8/month at 24/7 load.
Can I use the self-hosted Qwen API with ChatGPT-compatible applications?
Yes. Ollama exposes an OpenAI-compatible API at http://your-server:11434/v1. Set OPENAI_API_BASE=http://your-server:11434/v1 and OPENAI_API_KEY=anything in your application. Any tool that calls the OpenAI Chat Completions API β Continue.dev, Cursor (local mode), LangChain, AutoGen β works without modification.
Update Log
- 2026-05-26: Initial publication. Benchmark data from May 2026 hardware. Prices confirmed against Newegg, Amazon, and GPU market trackers.
- Next review scheduled: 2026-11-26