PromptQuorumPromptQuorum
Home/Local LLMs/Fix Local LLM Errors in 2026: 10 Common Problems in Ollama, LM Studio, and vLLM
Getting Started

Fix Local LLM Errors in 2026: 10 Common Problems in Ollama, LM Studio, and vLLM

Β·9 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

The most common errors in local LLMs are out-of-memory crashes, GPU not detected, extremely slow CPU inference, connection refused from the API, and garbled output.

The most common errors in local LLMs are out-of-memory crashes, GPU not detected, extremely slow CPU inference, connection refused from the API, and garbled output. As of April 2026, there are solutions for all 10 errors -- most require only one or two terminal commands. This guide covers Ollama (port 11434), LM Studio (port 1234), and vLLM with exact commands for each error.

Slide Deck: Fix Local LLM Errors in 2026: 10 Common Problems in Ollama, LM Studio, and vLLM

The following presentation covers: the 10 most common local LLM setup errors (out-of-memory, GPU not detected, slow inference, connection refused, garbled output), RAM requirements for 3B–14B models at Q4_K_M and Q8_0 quantization, a 5-step debug process, and Ollama commands for each fix. Download the PDF as a local LLM troubleshooting reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • Out of Memory: Switch to smaller quantization (Q4_K_M β†’ Q3_K_S) or smaller model.
  • GPU not detected on NVIDIA: Update driver to 525+ on Linux, 452+ on Windows. Run `nvidia-smi` to confirm.
  • Extremely slow inference: You are running on CPU only. Enable GPU offloading in Ollama with `OLLAMA_GPU_LAYERS` environment variable.
  • Connection refused: Ollama is not running. Start it with `ollama serve` or restart the service.
  • Garbled output: wrong prompt template. Use the Instruct variant of the model, not the base variant.
10 most common local LLM errors with symptoms and fixes β€” quick reference for Ollama, LM Studio, and vLLM setups (April 2026).
10 most common local LLM errors with symptoms and fixes β€” quick reference for Ollama, LM Studio, and vLLM setups (April 2026).

Error 1: "Out of Memory" / Out-of-Memory Crash

Out-of-Memory errors mean the model needs more RAM than available -- not a hardware failure. This is the most common error for first-time users. See LLM Quantization Explained for background on how quantization reduces RAM requirements.

  • Check available RAM: Run `free -h` on macOS/Linux, or open Task Manager β†’ Performance β†’ Memory on Windows.
  • Switch to smaller quantization: Replace `Q8_0` or `Q5_K_M` with `Q4_K_M`. For Ollama: `ollama run llama3.2-instruct-q4_K_M`.
  • Close background applications before loading the model -- browsers and other apps consume RAM, reducing what the model has available.
  • Switch to smaller model: if 8B fails on 8 GB RAM, try `llama3.2:3b` (requires only ~2.5 GB).
Local LLM RAM requirements by model size: llama3.2 1B–3B fits in 8 GB, 7B–8B models need 16 GB, 70B models need 64 GB at Q4_K_M quantization.
Local LLM RAM requirements by model size: llama3.2 1B–3B fits in 8 GB, 7B–8B models need 16 GB, 70B models need 64 GB at Q4_K_M quantization.

Check Available RAM on Linux / macOS

bash
# Linux
free -h

# macOS
vm_stat | grep "Pages free"

# More readable on macOS
top -l 1 | grep "PhysMem"

Error 2: GPU Not Being Used (Running on CPU Only)

GPU not being used means the LLM runs 5–10Γ— slower than expected -- check driver installation before anything else. Verify that your GPU is visible to the system:

bash
# NVIDIA β€” should show GPU name and driver version
nvidia-smi

# AMD on Linux
rocm-smi

# macOS β€” check if Metal is available
system_profiler SPDisplaysDataType | grep "Metal"
CPU-only vs GPU-active: Ollama on CPU gives 2–8 tok/s; GPU mode gives 30–120 tok/s. Check with ollama ps or nvidia-smi.
CPU-only vs GPU-active: Ollama on CPU gives 2–8 tok/s; GPU mode gives 30–120 tok/s. Check with ollama ps or nvidia-smi.

How Do You Enable GPU in Ollama?

  • NVIDIA on Linux: Install NVIDIA driver 525+ and CUDA Toolkit 11.3+. Ollama auto-detects CUDA on restart.
  • NVIDIA on Windows: Ensure driver version is 452.39 or higher. Ollama automatically installs CUDA support via the Windows installer.
  • AMD on Linux: Install ROCm 5.7+. If detection fails, set `HSA_OVERRIDE_GFX_VERSION=11.0.0` for RX 6000-series cards.
  • Apple Silicon: Ollama uses Metal by default -- no configuration needed. Confirm with `ollama ps` after loading a model; GPU layers appear in output.

Error 3: Inference Is Extremely Slow (Under 5 Tokens/Second)

Under 5 tokens/second means the model is running on CPU only or the model is too large for available VRAM. A 7B model on GPU generates 30–80 tok/s; the same model on CPU generates 3–10 tok/s.

  • Confirm whether GPU is active: Run `ollama ps` while a model is loaded. The output shows how many layers are on GPU vs CPU.
  • Reduce model size: a 13B model on CPU generates 3–6 tok/s. Switching to 7B doubles the speed; switching to 3B quadruples it.
  • Increase GPU layers in Ollama: Set `OLLAMA_GPU_LAYERS=999` to push all layers to GPU (Ollama will cap to what fits in VRAM).
  • Use faster quantization: Q4_K_M is the fastest quantization that maintains acceptable quality. Q8_0 is higher quality but ~30% slower.

Set GPU Layers in Ollama

bash
# Set environment variable before starting Ollama
export OLLAMA_GPU_LAYERS=999
ollama serve

# Or in a Modelfile
FROM llama3.1:8b
PARAMETER num_gpu 999

Error 4: "Connection Refused" When Calling the API

Connection Refused means Ollama is not running -- the API at `localhost:11434` only responds when the service is active. Start it before making API calls.

bash
# Start Ollama manually
ollama serve

# On Linux -- restart the systemd service
systemctl restart ollama

# Verify it is running
curl http://localhost:11434
# Expected: "Ollama is running"

Error 5: "Model Not Found" Error

"Model not found" means the model name in your command does not match any downloaded model. Model names in Ollama are case-sensitive and include version tags.

bash
# List all downloaded models
ollama list

# Pull a model if it is missing
ollama pull llama3.2

# Check the exact model name -- tags matter
# "llama3.2" and "llama3.2:3b" are different entries

Error 6: Corrupted Model File

Corrupted model files are caused by interrupted downloads -- delete and re-pull to fix. Ollama does not always auto-detect partial downloads.

bash
# Remove the corrupted model
ollama rm llama3.2

# Re-pull it
ollama pull llama3.2

# For LM Studio: manually delete model files
# Default location: ~/.cache/lm-studio/models/

Error 6b: "Failed to Resolve Model" in LM Studio

"Failed to resolve model lmstudio-community/..." means LM Studio cannot find the model in its registry. This typically happens when a model is downloaded from `lmstudio-community` on Hugging Face but the registry reference has changed. LM Studio is using a cached registry entry that no longer matches available model files.

  • Open LM Studio β†’ My Models tab β†’ click the three-dot menu on the failed model β†’ select "Delete model" (keeps the file, removes registry)
  • Search for the same model in the model browser and re-download it -- LM Studio will re-register it
  • Alternative: quit LM Studio, navigate to `~/.cache/lm-studio/models/`, delete the specific model folder, then re-download
bash
# Manually clear LM Studio model cache (macOS/Linux)
rm -rf ~/.cache/lm-studio/models/lmstudio-community/<model-name>

Error 7: CUDA / ROCm Initialization Errors

CUDA and ROCm errors mean driver/library version mismatch -- update your driver to the required minimum version.

  • "CUDA driver version insufficient": Update NVIDIA driver. The minimum for llama.cpp is CUDA 11.3 / driver 450.80.
  • "No kernel image available for execution": Your GPU architecture is unsupported. GTX 900-series (Maxwell) and older are not supported by recent CUDA builds.
  • AMD ROCm "HSA_STATUS_ERROR_INVALID_ISA": Set `HSA_OVERRIDE_GFX_VERSION=10.3.0` (for RX 6000) or `11.0.0` (for RX 7000) before starting Ollama.
  • Check CUDA version: Run `nvcc --version` or `nvidia-smi | grep CUDA`.

Error 8: Garbled, Repetitive, or Nonsensical Output

Garbled output almost always means you are using a base model instead of an instruct/chat variant. Base models generate raw text completions, not answers to questions.

Base models (e.g., `llama3.1:8b`) are not fine-tuned for conversation, and when prompted with a question, generate raw completions that look like gibberish. Always use the instruct variant: `llama3.1:8b-instruct`. See How to Install LM Studio for a GUI-based method to switch model variants.

In Ollama, the default tag for most models already points to the instruct variant. If you manually downloaded from Hugging Face, confirm the filename includes "Instruct" or "chat".

Error 9: "Address Already in Use" -- Port Conflict

"Address already in use" means another process is occupying port 11434 (Ollama) or 1234 (LM Studio). Find and kill the conflicting process.

bash
# Find what is using port 11434 (Ollama)
lsof -i :11434

# Kill it by PID
kill -9 <PID>

# Or change Ollama port
export OLLAMA_HOST=0.0.0.0:11435
ollama serve

Error 10: Model Stops Generating Mid-Response

Stopping mid-response is caused by reaching context length limits or `num_predict` set too low. The default `num_predict` in many configurations is 128 tokens -- just enough for 1–2 sentences.

  • Increase num_predict: This parameter sets the maximum tokens to generate. Default is often 128. Increase it: In Ollama, add `PARAMETER num_predict 2048` to the Modelfile.
  • Check context window: If your conversation is very long, the model may hit its context limit. Start a new session or use a model with a larger context window (Llama 3.2 3B supports 128K).
  • Check stop tokens: Some Modelfiles include stop sequences that terminate generation early. Check the system prompt and template for unexpected stop patterns.

Where to Find More Help

For hardware-specific issues on laptops (thermal throttling, battery drain), see How to Run Local LLMs on a Laptop. For security and privacy configuration questions, see Local LLM Security & Privacy Checklist. The Ollama GitHub issues page (github.com/ollama/ollama/issues) and r/LocalLLaMA subreddit are the most active community resources for model-specific bugs.

Common Mistakes in Local LLM Troubleshooting

  • Confusing OOM errors with hardware failure -- the error means RAM is too small for the model, not that hardware is broken. Fix: use Q4_K_M quantization or smaller model.
  • Not checking system load -- inference speed degrades significantly when other applications consume CPU/GPU. Close your browser, video player, and background processes before benchmarking.
  • Ignoring driver version incompatibility -- NVIDIA CUDA requires specific driver versions per CUDA release. Check `nvidia-smi` output; driver version must be β‰₯450.80 for CUDA 11.x.
  • Using wrong model name in Ollama -- `llama3.2` and `llama3.2:3b` are different Ollama tags. Run `ollama list` to see the exact names of downloaded models.
  • Not restarting Ollama after driver update -- Ollama detects GPU at startup. After updating NVIDIA or ROCm drivers, fully restart Ollama (`ollama serve`) to re-detect GPU.
5-step local LLM debugging process: check RAM β†’ check GPU β†’ check server β†’ check model β†’ check output quality. Stop at the first failure step.
5-step local LLM debugging process: check RAM β†’ check GPU β†’ check server β†’ check model β†’ check output quality. Stop at the first failure step.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

Fix Local LLM Errors: OOM, GPU Detection, Port 11434