PromptQuorumPromptQuorum
Startseite/Lokale LLMs/Quantization, Offloading, and Layer Splitting: Advanced VRAM Reduction
Hardware & Performance

Quantization, Offloading, and Layer Splitting: Advanced VRAM Reduction

Β·12 min readΒ·Von Hans Kuepper Β· GrΓΌnder von PromptQuorum, Multi-Model-AI-Dispatch-Tool Β· PromptQuorum

When your GPU VRAM is insufficient for a model, three advanced techniques can help: quantization (reducing bit precision), offloading (spilling to system RAM), and layer splitting (distributing across multiple GPUs). As of April 2026, these techniques are mature and can fit 70B models into 24 GB VRAM.

Wichtigste Erkenntnisse

  • Quantization: FP16 β†’ Q8 β†’ Q5 β†’ Q4. Each step halves file size, reduces VRAM by 25–50%. Quality impact: negligible at Q5, minor at Q4.
  • Offloading: Move model layers to system RAM when VRAM full. Speed penalty: 5–10Γ— slower (RAM is 200 GB/sec vs GPU 2000 GB/sec).
  • Layer splitting: Distribute model across 2+ GPUs. Example: 70B model on 2Γ— RTX 4090 = ~100 tokens/sec.
  • As of April 2026, these techniques work together: e.g., Q4 quantization + offloading = 70B on 12 GB VRAM (very slow), or Q5 + layer split = 70B on 2Γ— 16GB GPUs (usable).

Quantization: The Primary VRAM Reducer

Quantization reduces the precision of model weights from floating-point 16-bit (FP16) to lower bits (Q8, Q5, Q4, Q3).

MethodFile Size (7B)VRAMQualitySpeed
No quantization (FP16)β€”14 GB100%Baseline
Dynamic Q8β€”7 GB99%Baseline
Static Q5β€”4.4 GB95%Baseline
AWQ (weight activation)β€”3.5 GB98%Baseline
GPTQ (GPU quantization)β€”3.5 GB97%95% baseline

Offloading: CPU RAM as Spillover

When VRAM is full, models can offload (move) layers to system RAM. Offloading trades speed for capacity.

Scenario: Running 70B Q4 model on RTX 4090 (24 GB). Model needs 35 GB. With offloading, run at ~5–10 tokens/sec (80% to RAM).

Offloading is a last resort β€” it makes inference impractical. Use only for offline batch processing or experimentation.

bash
# Ollama: enable offloading
export OLLAMA_NUM_GPU=0  # Disable GPU (force CPU)
ollama run llama3.1:70b

# vLLM: enable CPU offload (partial)
vllm serve meta-llama/Llama-2-70b-hf \
  --gpu-memory-utilization 0.7 \
  --cpu-offload-gb 10  # Offload 10GB to RAM

Layer Splitting: Distribute Across Multiple GPUs

Modern inference engines (vLLM, llama.cpp) can split a model across multiple GPUs automatically.

Example: 70B model with 2Γ— RTX 4090:

- Without splitting: Impossible (needs 40+ GB VRAM in one GPU).

- With splitting: Half the model weights on each GPU. Inference speed: ~100 tokens/sec (communication overhead is minimal).

Layer splitting is practical for production deployments and is transparent to the user.

bash
# vLLM: automatic tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 2  # Split across 2 GPUs

# llama.cpp: multi-GPU support
ollama run llama3.1:70b  # Auto-detects and splits across GPUs

Hybrid Approach: Combining Techniques

Best results come from combining all three:

Scenario 1: 70B on single RTX 4090 (24 GB)

- Quantize to Q4 (35 GB β†’ 18 GB)

- Use offloading for remaining 6 GB (to system RAM)

- Result: ~8–10 tokens/sec (slow but works)

Scenario 2: 70B on 2Γ— RTX 4090

- Quantize to Q5 (43.75 GB)

- Use layer splitting across 2 GPUs (22 GB each)

- Result: ~100 tokens/sec (practical)

Performance Trade-offs

Each technique comes with speed penalties:

TechniqueVRAM SavedSpeed ImpactQuality Impact
Quantization (Q4)50%None (Β±5%)Minor
Offloading (CPU RAM)60–80%5–10Γ— slowerNone
Layer splitting (2 GPUs)N/A (enables larger models)5–10% slowerNone
Quantization + Offloading75–90%3–5Γ— slowerMinor

Common Mistakes With Advanced Techniques

  • Expecting offloading to be fast. CPU RAM is 100Γ— slower than GPU VRAM for data transfer. Offloading makes inference impractical.
  • Assuming layer splitting doubles speed. It does not. Two GPUs running one model = ~90% of one GPU speed (overhead from GPU communication).
  • Quantizing below Q4 for chat. Q3 and Q2 cause noticeable quality loss. Acceptable only for lightweight tasks.
  • Not measuring actual VRAM usage. Use `nvidia-smi` to verify real VRAM consumption before committing to quantization levels.

Sources

  • vLLM Documentation β€” docs.vllm.ai
  • llama.cpp Multi-GPU β€” github.com/ggerganov/llama.cpp#multi-gpu-inference
  • GPTQ Quantization Paper β€” arxiv.org/abs/2210.17323

Vergleichen Sie Ihr lokales LLM gleichzeitig mit 25+ Cloud-Modellen in PromptQuorum.

PromptQuorum kostenlos testen β†’

← ZurΓΌck zu Lokale LLMs

Quantization and Offloading Techniques | PromptQuorum