Wichtigste Erkenntnisse
- Quantization: FP16 β Q8 β Q5 β Q4. Each step halves file size, reduces VRAM by 25β50%. Quality impact: negligible at Q5, minor at Q4.
- Offloading: Move model layers to system RAM when VRAM full. Speed penalty: 5β10Γ slower (RAM is 200 GB/sec vs GPU 2000 GB/sec).
- Layer splitting: Distribute model across 2+ GPUs. Example: 70B model on 2Γ RTX 4090 = ~100 tokens/sec.
- As of April 2026, these techniques work together: e.g., Q4 quantization + offloading = 70B on 12 GB VRAM (very slow), or Q5 + layer split = 70B on 2Γ 16GB GPUs (usable).
Quantization: The Primary VRAM Reducer
Quantization reduces the precision of model weights from floating-point 16-bit (FP16) to lower bits (Q8, Q5, Q4, Q3).
| Method | File Size (7B) | VRAM | Quality | Speed |
|---|---|---|---|---|
| No quantization (FP16) | β | 14 GB | 100% | Baseline |
| Dynamic Q8 | β | 7 GB | 99% | Baseline |
| Static Q5 | β | 4.4 GB | 95% | Baseline |
| AWQ (weight activation) | β | 3.5 GB | 98% | Baseline |
| GPTQ (GPU quantization) | β | 3.5 GB | 97% | 95% baseline |
Offloading: CPU RAM as Spillover
When VRAM is full, models can offload (move) layers to system RAM. Offloading trades speed for capacity.
Scenario: Running 70B Q4 model on RTX 4090 (24 GB). Model needs 35 GB. With offloading, run at ~5β10 tokens/sec (80% to RAM).
Offloading is a last resort β it makes inference impractical. Use only for offline batch processing or experimentation.
# Ollama: enable offloading
export OLLAMA_NUM_GPU=0 # Disable GPU (force CPU)
ollama run llama3.1:70b
# vLLM: enable CPU offload (partial)
vllm serve meta-llama/Llama-2-70b-hf \
--gpu-memory-utilization 0.7 \
--cpu-offload-gb 10 # Offload 10GB to RAMLayer Splitting: Distribute Across Multiple GPUs
Modern inference engines (vLLM, llama.cpp) can split a model across multiple GPUs automatically.
Example: 70B model with 2Γ RTX 4090:
- Without splitting: Impossible (needs 40+ GB VRAM in one GPU).
- With splitting: Half the model weights on each GPU. Inference speed: ~100 tokens/sec (communication overhead is minimal).
Layer splitting is practical for production deployments and is transparent to the user.
# vLLM: automatic tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 2 # Split across 2 GPUs
# llama.cpp: multi-GPU support
ollama run llama3.1:70b # Auto-detects and splits across GPUsHybrid Approach: Combining Techniques
Best results come from combining all three:
Scenario 1: 70B on single RTX 4090 (24 GB)
- Quantize to Q4 (35 GB β 18 GB)
- Use offloading for remaining 6 GB (to system RAM)
- Result: ~8β10 tokens/sec (slow but works)
Scenario 2: 70B on 2Γ RTX 4090
- Quantize to Q5 (43.75 GB)
- Use layer splitting across 2 GPUs (22 GB each)
- Result: ~100 tokens/sec (practical)
Performance Trade-offs
Each technique comes with speed penalties:
| Technique | VRAM Saved | Speed Impact | Quality Impact |
|---|---|---|---|
| Quantization (Q4) | 50% | None (Β±5%) | Minor |
| Offloading (CPU RAM) | 60β80% | 5β10Γ slower | None |
| Layer splitting (2 GPUs) | N/A (enables larger models) | 5β10% slower | None |
| Quantization + Offloading | 75β90% | 3β5Γ slower | Minor |
Common Mistakes With Advanced Techniques
- Expecting offloading to be fast. CPU RAM is 100Γ slower than GPU VRAM for data transfer. Offloading makes inference impractical.
- Assuming layer splitting doubles speed. It does not. Two GPUs running one model = ~90% of one GPU speed (overhead from GPU communication).
- Quantizing below Q4 for chat. Q3 and Q2 cause noticeable quality loss. Acceptable only for lightweight tasks.
- Not measuring actual VRAM usage. Use `nvidia-smi` to verify real VRAM consumption before committing to quantization levels.
Sources
- vLLM Documentation β docs.vllm.ai
- llama.cpp Multi-GPU β github.com/ggerganov/llama.cpp#multi-gpu-inference
- GPTQ Quantization Paper β arxiv.org/abs/2210.17323