PromptQuorumPromptQuorum
主页/本地LLM/Quantization, Offloading, and Layer Splitting: Advanced VRAM Reduction
Hardware & Performance

Quantization, Offloading, and Layer Splitting: Advanced VRAM Reduction

·12 min read·Hans Kuepper 作者 · PromptQuorum创始人,多模型AI调度工具 · PromptQuorum

When your GPU VRAM is insufficient for a model, three advanced techniques can help: quantization (reducing bit precision), offloading (spilling to system RAM), and layer splitting (distributing across multiple GPUs). As of April 2026, these techniques are mature and can fit 70B models into 24 GB VRAM.

关键要点

  • Quantization: FP16 → Q8 → Q5 → Q4. Each step halves file size, reduces VRAM by 25–50%. Quality impact: negligible at Q5, minor at Q4.
  • Offloading: Move model layers to system RAM when VRAM full. Speed penalty: 5–10× slower (RAM is 200 GB/sec vs GPU 2000 GB/sec).
  • Layer splitting: Distribute model across 2+ GPUs. Example: 70B model on 2× RTX 4090 = ~100 tokens/sec.
  • As of April 2026, these techniques work together: e.g., Q4 quantization + offloading = 70B on 12 GB VRAM (very slow), or Q5 + layer split = 70B on 2× 16GB GPUs (usable).

Quantization: The Primary VRAM Reducer

Quantization reduces the precision of model weights from floating-point 16-bit (FP16) to lower bits (Q8, Q5, Q4, Q3).

MethodFile Size (7B)VRAMQualitySpeed
No quantization (FP16)14 GB100%Baseline
Dynamic Q87 GB99%Baseline
Static Q54.4 GB95%Baseline
AWQ (weight activation)3.5 GB98%Baseline
GPTQ (GPU quantization)3.5 GB97%95% baseline

Offloading: CPU RAM as Spillover

When VRAM is full, models can offload (move) layers to system RAM. Offloading trades speed for capacity.

Scenario: Running 70B Q4 model on RTX 4090 (24 GB). Model needs 35 GB. With offloading, run at ~5–10 tokens/sec (80% to RAM).

Offloading is a last resort — it makes inference impractical. Use only for offline batch processing or experimentation.

bash
# Ollama: enable offloading
export OLLAMA_NUM_GPU=0  # Disable GPU (force CPU)
ollama run llama3.1:70b

# vLLM: enable CPU offload (partial)
vllm serve meta-llama/Llama-2-70b-hf \
  --gpu-memory-utilization 0.7 \
  --cpu-offload-gb 10  # Offload 10GB to RAM

Layer Splitting: Distribute Across Multiple GPUs

Modern inference engines (vLLM, llama.cpp) can split a model across multiple GPUs automatically.

Example: 70B model with 2× RTX 4090:

- Without splitting: Impossible (needs 40+ GB VRAM in one GPU).

- With splitting: Half the model weights on each GPU. Inference speed: ~100 tokens/sec (communication overhead is minimal).

Layer splitting is practical for production deployments and is transparent to the user.

bash
# vLLM: automatic tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 2  # Split across 2 GPUs

# llama.cpp: multi-GPU support
ollama run llama3.1:70b  # Auto-detects and splits across GPUs

Hybrid Approach: Combining Techniques

Best results come from combining all three:

Scenario 1: 70B on single RTX 4090 (24 GB)

- Quantize to Q4 (35 GB → 18 GB)

- Use offloading for remaining 6 GB (to system RAM)

- Result: ~8–10 tokens/sec (slow but works)

Scenario 2: 70B on 2× RTX 4090

- Quantize to Q5 (43.75 GB)

- Use layer splitting across 2 GPUs (22 GB each)

- Result: ~100 tokens/sec (practical)

Performance Trade-offs

Each technique comes with speed penalties:

TechniqueVRAM SavedSpeed ImpactQuality Impact
Quantization (Q4)50%None (±5%)Minor
Offloading (CPU RAM)60–80%5–10× slowerNone
Layer splitting (2 GPUs)N/A (enables larger models)5–10% slowerNone
Quantization + Offloading75–90%3–5× slowerMinor

Common Mistakes With Advanced Techniques

  • Expecting offloading to be fast. CPU RAM is 100× slower than GPU VRAM for data transfer. Offloading makes inference impractical.
  • Assuming layer splitting doubles speed. It does not. Two GPUs running one model = ~90% of one GPU speed (overhead from GPU communication).
  • Quantizing below Q4 for chat. Q3 and Q2 cause noticeable quality loss. Acceptable only for lightweight tasks.
  • Not measuring actual VRAM usage. Use `nvidia-smi` to verify real VRAM consumption before committing to quantization levels.

Sources

  • vLLM Documentation — docs.vllm.ai
  • llama.cpp Multi-GPU — github.com/ggerganov/llama.cpp#multi-gpu-inference
  • GPTQ Quantization Paper — arxiv.org/abs/2210.17323

使用PromptQuorum将您的本地LLM与25+个云模型同时进行比较。

免费试用PromptQuorum →

← 返回本地LLM

Quantization and Offloading Techniques | PromptQuorum