When your GPU VRAM is insufficient for a model, three advanced techniques can help: quantization (reducing bit precision), offloading (spilling to system RAM), and layer splitting (distributing across multiple GPUs). As of April 2026, these techniques are mature and can fit 70B models into 24 GB VRAM.

关键要点

Quantization: FP16 → Q8 → Q5 → Q4. Each step halves file size, reduces VRAM by 25–50%. Quality impact: negligible at Q5, minor at Q4.
Offloading: Move model layers to system RAM when VRAM full. Speed penalty: 5–10× slower (RAM is 200 GB/sec vs GPU 2000 GB/sec).
Layer splitting: Distribute model across 2+ GPUs. Example: 70B model on 2× RTX 4090 = ~100 tokens/sec.
As of April 2026, these techniques work together: e.g., Q4 quantization + offloading = 70B on 12 GB VRAM (very slow), or Q5 + layer split = 70B on 2× 16GB GPUs (usable).

Quantization: The Primary VRAM Reducer

Quantization reduces the precision of model weights from floating-point 16-bit (FP16) to lower bits (Q8, Q5, Q4, Q3).

Method	File Size (7B)	VRAM	Quality	Speed
No quantization (FP16)	—	14 GB	100%	Baseline
Dynamic Q8	—	7 GB	99%	Baseline
Static Q5	—	4.4 GB	95%	Baseline
AWQ (weight activation)	—	3.5 GB	98%	Baseline
GPTQ (GPU quantization)	—	3.5 GB	97%	95% baseline

Offloading: CPU RAM as Spillover

When VRAM is full, models can offload (move) layers to system RAM. Offloading trades speed for capacity.

Scenario: Running 70B Q4 model on RTX 4090 (24 GB). Model needs 35 GB. With offloading, run at ~5–10 tokens/sec (80% to RAM).

Offloading is a last resort — it makes inference impractical. Use only for offline batch processing or experimentation.

bash

# Ollama: enable offloading
export OLLAMA_NUM_GPU=0  # Disable GPU (force CPU)
ollama run llama3.1:70b

# vLLM: enable CPU offload (partial)
vllm serve meta-llama/Llama-2-70b-hf \
  --gpu-memory-utilization 0.7 \
  --cpu-offload-gb 10  # Offload 10GB to RAM

Layer Splitting: Distribute Across Multiple GPUs

Modern inference engines (vLLM, llama.cpp) can split a model across multiple GPUs automatically.

Example: 70B model with 2× RTX 4090:

- Without splitting: Impossible (needs 40+ GB VRAM in one GPU).

- With splitting: Half the model weights on each GPU. Inference speed: ~100 tokens/sec (communication overhead is minimal).

Layer splitting is practical for production deployments and is transparent to the user.

bash

# vLLM: automatic tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 2  # Split across 2 GPUs

# llama.cpp: multi-GPU support
ollama run llama3.1:70b  # Auto-detects and splits across GPUs

Hybrid Approach: Combining Techniques

Best results come from combining all three:

Scenario 1: 70B on single RTX 4090 (24 GB)

- Quantize to Q4 (35 GB → 18 GB)

- Use offloading for remaining 6 GB (to system RAM)

- Result: ~8–10 tokens/sec (slow but works)

Scenario 2: 70B on 2× RTX 4090

- Quantize to Q5 (43.75 GB)

- Use layer splitting across 2 GPUs (22 GB each)

- Result: ~100 tokens/sec (practical)

Performance Trade-offs

Each technique comes with speed penalties:

Technique	VRAM Saved	Speed Impact	Quality Impact
Quantization (Q4)	50%	None (±5%)	Minor
Offloading (CPU RAM)	60–80%	5–10× slower	None
Layer splitting (2 GPUs)	N/A (enables larger models)	5–10% slower	None
Quantization + Offloading	75–90%	3–5× slower	Minor

Common Mistakes With Advanced Techniques

Expecting offloading to be fast. CPU RAM is 100× slower than GPU VRAM for data transfer. Offloading makes inference impractical.
Assuming layer splitting doubles speed. It does not. Two GPUs running one model = ~90% of one GPU speed (overhead from GPU communication).
Quantizing below Q4 for chat. Q3 and Q2 cause noticeable quality loss. Acceptable only for lightweight tasks.
Not measuring actual VRAM usage. Use `nvidia-smi` to verify real VRAM consumption before committing to quantization levels.

Sources

vLLM Documentation — docs.vllm.ai
llama.cpp Multi-GPU — github.com/ggerganov/llama.cpp#multi-gpu-inference
GPTQ Quantization Paper — arxiv.org/abs/2210.17323

Quantization, Offloading, and Layer Splitting: Advanced VRAM Reduction