PromptQuorumPromptQuorum
Home/Local LLMs/How to Double Local LLM Speed: Optimization Techniques
Hardware & Performance

How to Double Local LLM Speed: Optimization Techniques

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Local LLMs can be 2-3Γ— faster with proper optimization. Techniques include: disabling logging, reducing batch size, optimizing quantization, using faster inference engines, and GPU memory tuning.

Local LLMs can be 2-3Γ— faster with proper optimization. Techniques include: disabling logging, reducing batch size, optimizing quantization, using faster inference engines, and GPU memory tuning. As of April 2026, combining these techniques can achieve 2Γ— speed improvement with no quality loss.

Key Takeaways

  • Disable logging/debugging (easy): ~10% speed gain.
  • Use Q4 quantization (easy): Same speed, smaller VRAM.
  • Optimize batch size (medium): 2-3Γ— speed for batch processing.
  • Use vLLM instead of Ollama (hard): 2-5Γ— speed for concurrent requests.
  • GPU memory utilization 90%+ (medium): 15-20% speed gain.
  • Combining all techniques: ~2-3Γ— total speedup.

How Does GPU Memory Utilization Affect Speed?

By default, most tools use 70–80% of GPU VRAM β€” leaving free memory idle. Increasing to 90–95% improves speed by 15–20% by allowing the engine to pre-allocate more KV cache:

bash
# vLLM: increase GPU memory utilization
vllm serve meta-llama/Llama-2-7b-hf \
  --gpu-memory-utilization 0.95

# Ollama: environment variable
export OLLAMA_GPU_THRESHOLD=0.95  # Use 95% of GPU
ollama run llama3.2:3b

# LM Studio: Settings β†’ GPU acceleration slider (move to 100%)

What Batch Size Maximizes Throughput?

For batch processing (multiple prompts), increasing batch size from 1 to 32 yields 2–4Γ— throughput improvement.

Single request = limited pipeline utilization. Batch 32 requests = 2–4Γ— throughput.

Trade-off: Higher latency per individual request (they wait for batch to complete).

Batch SizeThroughputLatency/RequestUse Case
1 (single)50 tokens/secMinimumReal-time chat
8120 tokens/secAcceptableLight concurrency
32200 tokens/secHighBatch API
64+250+ tokens/secVery highOffline batch

Which Inference Engine Is Fastest: vLLM vs Ollama vs llama.cpp?

vLLM: 5–10Γ— faster than Ollama for concurrent requests β€” use for production APIs serving multiple users.

llama.cpp: Fastest for single requests on consumer hardware β€” use for personal local setups.

Ollama: Best developer experience for single-user setups; comparable to llama.cpp for single requests.

Text-Generation-WebUI: Slowest, but most features β€” for experimentation only, not production.

Does Quantization Actually Speed Up Inference?

On modern GPUs (RTX 40-series), Q4 and Q5 run at the same speed as FP16 β€” quantize for VRAM reduction, not speed.

Indirect speed benefits of quantization:

- Smaller model file = faster cold-start loading from disk

- Reduced memory bandwidth = slightly faster (10–15%) on older or memory-constrained hardware

Quantization is primarily for VRAM reduction, not raw token throughput.

How Much Speed Can You Realistically Gain?

Example: Optimizing a 7B model on RTX 4090 β€” step by step:

ChangeSpeedCumulative Gain
Default Ollama (baseline)120 tok/secβ€”
Disable debug logging132 tok/sec+10%
GPU memory β†’ 95%150 tok/sec+25% total
Switch to vLLM (batch)300 tok/sec (batch)+2.5Γ— (batch)
All optimizations combined300 tok/sec+2.5Γ— throughput

Common Speed Optimization Mistakes

  • Pushing GPU memory to 100%. Risks out-of-memory crashes. Safe max is 90-95%.
  • Lowering batch size for speed. Batch size does not affect single-request latency. Only helps throughput.
  • Over-quantizing for speed. Q4 is roughly the same speed as FP16. Quantize for VRAM, not speed.
  • Changing inference engine mid-deployment. Switching Ollama β†’ vLLM β†’ llama.cpp introduces bugs. Pick one, optimize it.

Frequently Asked Questions

What is the single most effective way to speed up local LLM inference?

Switching from Ollama to vLLM for concurrent requests provides the largest single speedup β€” 5–10Γ— throughput improvement for batch processing. For single requests, increasing GPU memory utilization from 70% to 90–95% yields 15–20% speed gain. Disable debug logging for an additional 10%.

Does batch processing improve single-request latency?

No β€” batch size affects throughput (total tokens per second across all requests), not single-request latency. To reduce latency on one request, optimize GPU memory utilization and use a faster engine (vLLM or llama.cpp). Larger batches increase per-request wait time.

How much faster is vLLM than Ollama?

For single requests, vLLM and Ollama perform similarly (both reach ~120–150 tok/sec on an RTX 4090 with a 7B model). For concurrent requests, vLLM is 5–10Γ— faster due to continuous batching and PagedAttention. Use Ollama for personal/single-user setups; switch to vLLM for APIs serving multiple users.

Does quantization speed up inference?

Quantization's primary benefit is VRAM reduction, not speed. On modern NVIDIA GPUs (RTX 40-series), Q4 and Q5 run at the same speed as FP16. The indirect speed benefit: a smaller Q4 model loads faster from disk and may allow slightly larger batch sizes within the same VRAM.

What GPU memory utilization should I set for maximum speed?

Set GPU memory utilization to 90–95% in vLLM (`--gpu-memory-utilization 0.92`). This allows the engine to pre-allocate more memory for KV cache, improving throughput. Avoid 100% β€” it causes OOM crashes when generation exceeds predictions. The 5–10% safety margin is non-negotiable.

Why is my local LLM slower after the first prompt?

The first prompt loads the model into VRAM (cold start), which can take 10–30 seconds. Subsequent prompts run at full speed. Keep the server running (do not restart between sessions). With Ollama, set OLLAMA_KEEP_ALIVE=24h to prevent model unloading after inactivity.

Can CPU-only inference be sped up meaningfully?

Limited gains are possible: use llama.cpp with -t flag to set thread count to physical core count (not logical), enable AVX2/AVX-512 instruction sets, and use Q4_K_M quantization. Realistic ceiling: 8–12 tok/sec on a modern i9. For interactive chat, GPU hardware is the only path to acceptable latency.

How does context length affect inference speed?

Longer context windows slow inference because the attention mechanism scales quadratically with context length. A 4K context prompt is ~4Γ— slower to process than a 1K prompt. Keep system prompts under 500 tokens and use context summarization for long conversations to maintain speed.

What is PagedAttention and why does it speed up vLLM?

PagedAttention is vLLM's KV cache management system. Instead of pre-allocating a fixed memory block per request, it pages memory dynamically β€” like virtual memory in an OS. This eliminates VRAM fragmentation, allows more concurrent requests, and improves GPU utilization from ~55% (naive) to 90%+.

Is there a speed difference between GGUF and safetensors model formats?

Yes. GGUF (used by llama.cpp and Ollama) is optimized for CPU/consumer GPU inference with built-in quantization. Safetensors (used by vLLM and HuggingFace) is faster for full-precision GPU inference. For RTX 40-series GPUs running FP16, safetensors + vLLM typically outperforms GGUF + Ollama by 10–20%.

Sources

  • vLLM Optimization Guide -- docs.vllm.ai/en/dev_guide/performance_tuning.html
  • Ollama Performance Tips -- github.com/ollama/ollama/blob/main/docs/troubleshooting.md

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

How to Double Local LLM Speed: Optimization Guide 2026