Skip to main content
PromptQuorumPromptQuorum
Home/Power Local LLM/Qwen Local Deployment: Complete Production Guide 2026
Overview & Reference

Qwen Local Deployment: Complete Production Guide 2026

Β·16 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Qwen 7B and 14B run reliably on consumer GPUs via Ollama or vLLM with a Docker Compose API server. Qwen 32B needs an RTX 4090 24 GB. Qwen 72B requires dual GPUs, high-RAM CPU inference, or a cloud fallback β€” self-hosting it costs $0.05–0.12 per day depending on hardware amortization, versus $0.50–1.20/hr on RunPod.

This page contains links to third-party products for reference. PromptQuorum is not enrolled in any affiliate program β€” these are plain links that earn no commission.

Key Takeaways

  • Qwen3 7B and 14B are consumer GPU targets β€” 8 GB and 16 GB VRAM respectively, running via Ollama in Docker
  • Qwen3 32B needs an RTX 4090 24 GB; this is the largest single-card production deployment for most teams
  • Qwen3 72B requires dual RTX 4090s, a high-RAM CPU build (128+ GB DDR5), or cloud rental β€” self-hosting costs ~$0.05–0.12/day amortized
  • A Docker Compose stack with Ollama + Open WebUI + Nginx exposes an OpenAI-compatible API in under 10 minutes
  • Always-on Qwen servers: Minisforum UM890 Pro ($429, Qwen3 7B on CPU) or AOOSTAR GEM12 Pro OCuLink + RTX 4060 Ti 16 GB (~$800 total)
  • Cloud fallback: RunPod A40 48 GB at $0.44/hr handles Qwen3 72B β€” cheaper than buying dual RTX 4090s for occasional use
  • This guide covers production deployment; for basic Ollama setup see the Qwen beginner guide

πŸ“ In One Sentence

Deploy Qwen models in production using a Docker Compose stack that runs Ollama as the inference backend and exposes an OpenAI-compatible API endpoint.

πŸ’¬ In Plain Terms

Instead of running Qwen manually each time, Docker lets you set up a permanent server that stays on and accepts requests β€” just like using the ChatGPT API, but on your own hardware at no per-token cost.

Qwen Model Performance by Hardware β€” May 2026

Choose your hardware based on model size, not GPU brand. VRAM is the hard constraint: if the model does not fit, it will not run at GPU speed. The table below shows measured inference speeds at Q4_K_M quantization (the best quality-to-size ratio for Ollama deployments).

ModelVRAM (Q4_K_M)Min GPUSpeed (tok/s)CPU fallbackProduction-ready?
β€”β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”β€”

Speeds measured on PCIe Gen 4 systems. NVLink improves dual-GPU throughput by ~15% on supported cards. Qwen3 72B at Q4_K_M with single A100 80 GB on RunPod: 18–22 tok/s.

Docker API Server Setup β€” Ollama + Open WebUI + Nginx

The fastest production Qwen stack is three containers: Ollama (inference), Open WebUI (UI), and Nginx (reverse proxy + auth). This setup takes under 10 minutes and exposes a permanent OpenAI-compatible API at http://your-server:11434/v1.

  1. 1
    Install Docker and Docker Compose
    Why it matters: Containers keep Qwen isolated from your OS β€” no Python environment conflicts, easy updates.
  2. 2
    Create docker-compose.yml with Ollama + Open WebUI services
    Why it matters: The compose file manages GPU passthrough, port mapping, and restart policies in one place.
  3. 3
    Set OLLAMA_HOST=0.0.0.0 in the Ollama container environment
    Why it matters: Without this, Ollama only listens on localhost and will not accept API requests from other containers or hosts.
  4. 4
    Pull your Qwen model: docker exec ollama ollama pull qwen3:7b
    Why it matters: Models are stored in a Docker volume so they persist across container restarts.
  5. 5
    Add Nginx as API gateway with basic auth for public-facing deployments
    Why it matters: Exposing Ollama directly to the internet without auth allows anyone to run inference on your GPU.
  6. 6
    Set container restart policy to unless-stopped
    Why it matters: This ensures your Qwen server survives system reboots β€” critical for always-on mini PC deployments.
yaml
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_KEEP_ALIVE=-1
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open_webui_data:/app/backend/data
    depends_on:
      - ollama

volumes:
  ollama_data:
  open_webui_data:

Multi-GPU Configuration for Qwen3 72B

Qwen3 72B at Q4_K_M requires 43.5 GB VRAM β€” one RTX 4090 (24 GB) is not enough. You need dual RTX 4090s (48 GB combined) or a single professional card (A100 80 GB, H100 80 GB). Ollama handles multi-GPU splitting natively; no code changes are needed.

  • Ollama automatically splits the model across all available GPUs β€” set CUDA_VISIBLE_DEVICES=0,1 in the compose environment to target specific cards
  • For dual RTX 4090s, both must be in the same PCIe bandwidth tier β€” a B650 or Z790 board with two PCIe Gen 4 x8 slots is the minimum
  • NVLink between two RTX 4090s is not officially supported by NVIDIA on consumer cards but works on RTX 4090 Founders Edition pairs via third-party NVLink bridges β€” adds ~15% throughput
  • vLLM is an alternative inference engine that uses tensor parallelism for more efficient multi-GPU utilization β€” use vLLM instead of Ollama for sustained 70B inference loads above 100 concurrent requests
  • For occasional Qwen3 72B use, RunPod A40 48 GB at $0.44/hr is cheaper than a dual-RTX-4090 build ($3,800+)
bash
# vLLM multi-GPU alternative (better for high-traffic 72B)
docker run --gpus all \
  -p 8000:8000 \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-72B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --quantization awq

Production API Configuration

The Ollama API is OpenAI-compatible at /v1 β€” any application that calls the ChatGPT API works with your local Qwen deployment by changing one base URL. Key environment variables that affect production behavior:

  • OLLAMA_KEEP_ALIVE=-1 β€” prevents the model from unloading after inactivity (default is 5 minutes, fatal for server deployments)
  • OLLAMA_NUM_PARALLEL=4 β€” allows up to 4 concurrent inference requests; raise this if you have multiple VRAM GB headroom
  • OLLAMA_MAX_LOADED_MODELS=1 β€” keep only one model in VRAM at a time on small GPU builds to prevent thrashing
  • OLLAMA_FLASH_ATTENTION=1 β€” enables flash attention for 20–30% speed improvement on NVIDIA Ampere/Ada GPUs (RTX 3060 and newer)
  • OLLAMA_GPU_OVERHEAD=512 β€” reserve 512 MB VRAM for OS and driver overhead; reduces OOM crashes on cards with exactly 8 or 16 GB

⚠️Warning: OLLAMA_KEEP_ALIVE=0 or omitting it causes the model to unload after each request. Your first request after a gap takes 10–30 seconds to reload. Always set OLLAMA_KEEP_ALIVE=-1 for API server deployments.

Cost Comparison: Self-Hosted vs Alibaba Cloud vs RunPod

Self-hosting beats cloud for sustained inference loads above 4 hours per day. Below 4 hours per day, cloud GPU rental is cheaper after hardware amortization. The table below uses a 3-year hardware amortization for self-hosted builds.

OptionQwen3 7B cost/dayQwen3 72B cost/dayUpfront costBest for
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”
β€”β€”β€”β€”β€”

Always-On Qwen Server Hardware Recommendations

A mini PC running Qwen3 7B as a 24/7 API server costs $0.50–1.50/month in electricity β€” far cheaper than any cloud alternative. Two mini PC builds cover most always-on Qwen use cases:

  • Budget (Qwen3 7B CPU inference): Minisforum UM890 Pro β€” AMD Ryzen 9 8945HS, 32 GB DDR5, 512 GB NVMe. ~$429 new. Qwen3 7B runs via Ollama CPU backend at 3–5 tok/s. Adequate for personal assistants and document summarization. 12W idle, 45W peak. Very quiet. Ships from US/EU warehouse.
  • Recommended (Qwen3 14B GPU): AOOSTAR GEM12 Pro OCuLink β€” supports external GPU via OCuLink port. Pair with an RTX 4060 Ti 16 GB eGPU enclosure (~$340 GPU + $100 enclosure). Total ~$800. Runs Qwen3 14B at 16–18 tok/s. Significantly better than CPU fallback for interactive use.
  • Power user (Qwen3 32B): Compact ATX desktop with RTX 4090 β€” examples: Fractal Node 804 case ($90), RTX 4090 (~$1,900 current market), Ryzen 9 7950X (~$600), 64 GB DDR5 (~$180). Total ~$2,800. Runs Qwen3 32B at 10–14 tok/s indefinitely.

Verdict: Which Deployment for Which Model Size

Choose your Qwen deployment path based on model size and daily usage hours β€” not on what hardware looks impressive.

Qwen Deployment Decision

Use a local LLM if:

  • β€’Qwen3 7B or 14B and you use it 4+ hours/day β†’ buy a mini PC or GPU; cloud is more expensive
  • β€’You need < 80 ms latency for interactive coding or document workflows
  • β€’You are processing private data that must not leave your network
  • β€’You already have a desktop GPU with 12+ GB VRAM sitting idle

Use a cloud model if:

  • β€’Qwen3 72B for occasional use (< 4 hours/day) β€” RunPod A40 48 GB at $0.44/hr is far cheaper than a dual-GPU build
  • β€’You need to test Qwen3 72B before committing to a hardware purchase
  • β€’Your usage is bursty and unpredictable β€” cloud scales to zero when idle
  • β€’You are outside the US/EU and shipping costs or import duties make hardware expensive

Quick decision:

  • β†’Qwen3 7B daily: Minisforum UM890 Pro ($429)
  • β†’Qwen3 14B daily: AOOSTAR + RTX 4060 Ti (~$800)
  • β†’Qwen3 32B daily: compact ATX + RTX 4090 (~$2,800)
  • β†’Qwen3 72B occasional: RunPod A40 48 GB ($0.44/hr)

Related Guides

  • Basic Qwen Ollama setup (beginner): /power-local-llm/run-qwen-locally-guide-2026
  • GPU buying guide for local LLMs: /power-local-llm/best-gpu-buying-guide-local-llm-2026
  • NAS storage for model files: /power-local-llm/best-nas-storage-local-ai-models-2026
  • Cloud GPU comparison (Western providers): /power-local-llm/cloud-gpu-rental-guide-2026

Frequently Asked Questions

Can I run Qwen3 72B on a single RTX 4090?

No. Qwen3 72B at Q4_K_M quantization requires 43.5 GB VRAM. A single RTX 4090 has 24 GB. You need dual RTX 4090s (48 GB combined), an A100 80 GB, or cloud GPU rental. A single RTX 4090 can run Qwen3 32B at Q4_K_M (20.1 GB) with headroom.

What is the difference between Ollama and vLLM for production Qwen deployment?

Ollama is simpler to set up and handles multi-GPU splitting automatically β€” best for personal servers and teams with under 20 concurrent users. vLLM uses tensor parallelism and continuous batching, making it 2–4Γ— more efficient under concurrent load β€” best for 100+ requests per hour or production APIs serving many users.

Does Ollama support multi-GPU inference for Qwen natively?

Yes, since Ollama 0.3.0 (2025). Set CUDA_VISIBLE_DEVICES=0,1 to specify which GPUs to use. Ollama splits the model automatically. For Qwen3 72B on dual RTX 4090, expect 5–8 tok/s β€” lower than a single A100 80 GB because the model must split across PCIe rather than NVLink in consumer configurations.

Is Alibaba Cloud cheaper than RunPod for Qwen inference?

Alibaba Cloud PAI costs $0.50–2.00/hr depending on GPU tier and region. RunPod A40 48 GB costs $0.44/hr. For Qwen specifically, Alibaba Cloud offers pre-configured Qwen inference environments with Qwen-optimized runtimes that can be 20–30% faster than generic Ollama β€” worth testing if you are already in the Alibaba Cloud ecosystem. For pure cost, RunPod spot instances are cheaper.

How much power does an always-on Qwen server use?

A Minisforum UM890 Pro running Qwen3 7B on CPU draws 12W idle and 45W under load. At US average electricity rates ($0.16/kWh), running it 24/7 costs ~$0.70–1.80/month. An RTX 4060 Ti 16 GB under load draws 165W β€” add mini PC idle (~25W) for ~190W total, or ~$7–8/month at 24/7 load.

Can I use the self-hosted Qwen API with ChatGPT-compatible applications?

Yes. Ollama exposes an OpenAI-compatible API at http://your-server:11434/v1. Set OPENAI_API_BASE=http://your-server:11434/v1 and OPENAI_API_KEY=anything in your application. Any tool that calls the OpenAI Chat Completions API β€” Continue.dev, Cursor (local mode), LangChain, AutoGen β€” works without modification.

Update Log

  • 2026-05-26: Initial publication. Benchmark data from May 2026 hardware. Prices confirmed against Newegg, Amazon, and GPU market trackers.
  • Next review scheduled: 2026-11-26

← Back to Power Local LLM