重要なポイント
- Quantization: FP16 → Q8 → Q5 → Q4. Each step halves file size, reduces VRAM by 25–50%. Quality impact: negligible at Q5, minor at Q4.
- Offloading: Move model layers to system RAM when VRAM full. Speed penalty: 5–10× slower (RAM is 200 GB/sec vs GPU 2000 GB/sec).
- Layer splitting: Distribute model across 2+ GPUs. Example: 70B model on 2× RTX 4090 = ~100 tokens/sec.
- As of April 2026, these techniques work together: e.g., Q4 quantization + offloading = 70B on 12 GB VRAM (very slow), or Q5 + layer split = 70B on 2× 16GB GPUs (usable).
Quantization: The Primary VRAM Reducer
Quantization reduces the precision of model weights from floating-point 16-bit (FP16) to lower bits (Q8, Q5, Q4, Q3).
| Method | File Size (7B) | VRAM | Quality | Speed |
|---|---|---|---|---|
| No quantization (FP16) | — | 14 GB | 100% | Baseline |
| Dynamic Q8 | — | 7 GB | 99% | Baseline |
| Static Q5 | — | 4.4 GB | 95% | Baseline |
| AWQ (weight activation) | — | 3.5 GB | 98% | Baseline |
| GPTQ (GPU quantization) | — | 3.5 GB | 97% | 95% baseline |
Offloading: CPU RAM as Spillover
When VRAM is full, models can offload (move) layers to system RAM. Offloading trades speed for capacity.
Scenario: Running 70B Q4 model on RTX 4090 (24 GB). Model needs 35 GB. With offloading, run at ~5–10 tokens/sec (80% to RAM).
Offloading is a last resort — it makes inference impractical. Use only for offline batch processing or experimentation.
# Ollama: enable offloading
export OLLAMA_NUM_GPU=0 # Disable GPU (force CPU)
ollama run llama3.1:70b
# vLLM: enable CPU offload (partial)
vllm serve meta-llama/Llama-2-70b-hf \
--gpu-memory-utilization 0.7 \
--cpu-offload-gb 10 # Offload 10GB to RAMLayer Splitting: Distribute Across Multiple GPUs
Modern inference engines (vLLM, llama.cpp) can split a model across multiple GPUs automatically.
Example: 70B model with 2× RTX 4090:
- Without splitting: Impossible (needs 40+ GB VRAM in one GPU).
- With splitting: Half the model weights on each GPU. Inference speed: ~100 tokens/sec (communication overhead is minimal).
Layer splitting is practical for production deployments and is transparent to the user.
# vLLM: automatic tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 2 # Split across 2 GPUs
# llama.cpp: multi-GPU support
ollama run llama3.1:70b # Auto-detects and splits across GPUsHybrid Approach: Combining Techniques
Best results come from combining all three:
Scenario 1: 70B on single RTX 4090 (24 GB)
- Quantize to Q4 (35 GB → 18 GB)
- Use offloading for remaining 6 GB (to system RAM)
- Result: ~8–10 tokens/sec (slow but works)
Scenario 2: 70B on 2× RTX 4090
- Quantize to Q5 (43.75 GB)
- Use layer splitting across 2 GPUs (22 GB each)
- Result: ~100 tokens/sec (practical)
Performance Trade-offs
Each technique comes with speed penalties:
| Technique | VRAM Saved | Speed Impact | Quality Impact |
|---|---|---|---|
| Quantization (Q4) | 50% | None (±5%) | Minor |
| Offloading (CPU RAM) | 60–80% | 5–10× slower | None |
| Layer splitting (2 GPUs) | N/A (enables larger models) | 5–10% slower | None |
| Quantization + Offloading | 75–90% | 3–5× slower | Minor |
Common Mistakes With Advanced Techniques
- Expecting offloading to be fast. CPU RAM is 100× slower than GPU VRAM for data transfer. Offloading makes inference impractical.
- Assuming layer splitting doubles speed. It does not. Two GPUs running one model = ~90% of one GPU speed (overhead from GPU communication).
- Quantizing below Q4 for chat. Q3 and Q2 cause noticeable quality loss. Acceptable only for lightweight tasks.
- Not measuring actual VRAM usage. Use `nvidia-smi` to verify real VRAM consumption before committing to quantization levels.
Sources
- vLLM Documentation — docs.vllm.ai
- llama.cpp Multi-GPU — github.com/ggerganov/llama.cpp#multi-gpu-inference
- GPTQ Quantization Paper — arxiv.org/abs/2210.17323