PromptQuorumPromptQuorum
Home/Local LLMs/Troubleshooting Local LLM Setup: Fix the 10 Most Common Errors
Getting Started

Troubleshooting Local LLM Setup: Fix the 10 Most Common Errors

Β·9 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

The most common local LLM setup errors are: out-of-memory crashes, GPU not being detected, very slow inference on CPU, model file corruption, and connection refused errors from the API server. As of April 2026, GPU detection issues are less common due to driver improvements, but they still occur.

Key Takeaways

  • Out of memory: switch to a smaller quantization (Q4_K_M β†’ Q3_K_S) or a smaller model.
  • GPU not detected on NVIDIA: update drivers to 525+ on Linux, 452+ on Windows. Run `nvidia-smi` to confirm.
  • Very slow inference: you are running on CPU only. Enable GPU offloading in Ollama with the `OLLAMA_GPU_LAYERS` env var.
  • Connection refused: Ollama is not running. Start it with `ollama serve` or restart the service.
  • Garbled output: wrong prompt template. Use the instruct variant of the model, not the base variant.

Error 1: "Not Enough Memory" / Out-of-Memory Crash

The model requires more RAM than is available. This is the most common error for first-time users.

  • Check available RAM: on macOS/Linux run `free -h`, on Windows open Task Manager β†’ Performance β†’ Memory.
  • Switch to a smaller quantization: replace `Q8_0` or `Q5_K_M` with `Q4_K_M`. For Ollama: `ollama run llama3.1:8b-instruct-q4_K_M`.
  • Close background applications before loading the model β€” browsers and other apps consume RAM that reduces what is available for the model.
  • Switch to a smaller model: if 8B is failing on 8 GB RAM, try `llama3.2:3b` (requires only ~2.5 GB).

How Do You Check Available RAM on Linux / macOS?

bash
# Linux
free -h

# macOS
vm_stat | grep "Pages free"

# More readable on macOS
top -l 1 | grep "PhysMem"

Error 2: GPU Is Not Being Used (Running on CPU Only)

Verify your GPU is visible to the system before debugging the LLM tool:

bash
# NVIDIA β€” should show GPU name and driver version
nvidia-smi

# AMD on Linux
rocm-smi

# macOS β€” check Metal is available
system_profiler SPDisplaysDataType | grep "Metal"

How Do You Enable GPU in Ollama?

  • NVIDIA on Linux: install NVIDIA driver 525+ and CUDA toolkit 11.3+. Ollama detects CUDA automatically on restart.
  • NVIDIA on Windows: ensure driver version 452.39 or higher. Ollama installs CUDA support automatically via the Windows installer.
  • AMD on Linux: install ROCm 5.7+. Set `HSA_OVERRIDE_GFX_VERSION=11.0.0` for RX 6000-series cards if detection fails.
  • Apple Silicon: Ollama uses Metal by default β€” no configuration needed. Confirm with `ollama ps` after starting a model; GPU layers appear in the output.

Error 3: Inference Is Very Slow (Under 5 Tokens/sec)

If generation is under 5 tokens/sec, the model is running on CPU only, or you are running too large a model for your hardware.

  • Confirm whether GPU is active: run `ollama ps` while a model is loaded. The output shows how many layers are on GPU vs CPU.
  • Reduce model size: a 13B model on CPU generates 3–6 tok/sec. Switching to 7B doubles speed; switching to 3B quadruples it.
  • Increase GPU layers in Ollama: set `OLLAMA_GPU_LAYERS=999` to push all layers to GPU (Ollama will cap at what fits in VRAM).
  • Use a faster quantization: Q4_K_M is the fastest quantization that maintains acceptable quality. Q8_0 is higher quality but ~30% slower.

How Do You Set GPU Layers in Ollama?

bash
# Set environment variable before starting Ollama
export OLLAMA_GPU_LAYERS=999
ollama serve

# Or in a Modelfile
FROM llama3.1:8b
PARAMETER num_gpu 999

Error 4: "Connection Refused" When Calling the API

The Ollama server is not running. The API at `localhost:11434` only responds when the Ollama service is active.

bash
# Start Ollama manually
ollama serve

# On Linux β€” restart the systemd service
systemctl restart ollama

# Verify it is running
curl http://localhost:11434
# Expected: "Ollama is running"

Error 5: "Model Not Found" Error

This error means the model name in your command does not match any downloaded model.

bash
# List all downloaded models
ollama list

# Pull the model if it is missing
ollama pull llama3.2

# Check exact model name β€” tags matter
# "llama3.2" and "llama3.2:3b" are different entries

Error 6: Corrupted Model File

If a model download was interrupted, the cached file may be incomplete. Ollama does not always detect partial downloads.

bash
# Remove the corrupted model
ollama rm llama3.2

# Re-pull it
ollama pull llama3.2

# For LM Studio: delete the model file manually
# Default location: ~/.cache/lm-studio/models/

Error 7: CUDA / ROCm Initialization Errors

CUDA and ROCm errors typically mean a driver/library version mismatch.

  • "CUDA driver version is insufficient": update NVIDIA drivers. The minimum for llama.cpp is CUDA 11.3 / driver 450.80.
  • "no kernel image is available for execution": your GPU architecture is not supported. GTX 900-series (Maxwell) and older are not supported by recent CUDA builds.
  • AMD ROCm "HSA_STATUS_ERROR_INVALID_ISA": set `HSA_OVERRIDE_GFX_VERSION=10.3.0` (for RX 6000) or `11.0.0` (for RX 7000) before starting Ollama.
  • Check CUDA version: run `nvcc --version` or `nvidia-smi | grep CUDA`.

Error 8: Garbled, Repetitive, or Nonsensical Output

Garbled output almost always means you are using a base model instead of the instruct/chat variant, or the wrong prompt template is being applied.

Base models (e.g., `llama3.1:8b`) are not fine-tuned for conversation and produce raw completions that look like garbled text when prompted with a question. Always use the instruct variant: `llama3.1:8b-instruct`.

In Ollama, the default tag for most models already points to the instruct variant. If you downloaded from Hugging Face manually, confirm the filename includes "Instruct" or "chat".

Error 9: "Address Already in Use" β€” Port Conflict

Another process is using the port that Ollama or LM Studio needs.

bash
# Find what is using port 11434 (Ollama)
lsof -i :11434

# Kill it by PID
kill -9 <PID>

# Or change Ollama's port
export OLLAMA_HOST=0.0.0.0:11435
ollama serve

Error 10: Model Stops Generating Mid-Response

Mid-response stops are caused by hitting the context length limit or a generation parameter set too low.

  • Increase num_predict: this parameter sets the maximum number of tokens to generate. Default is often 128. Increase it: in Ollama, add `PARAMETER num_predict 2048` to a Modelfile.
  • Check context window: if your conversation is very long, the model may be hitting its context limit. Start a fresh session or use a model with a larger context window (Llama 3.2 supports 128K).
  • Check for stop tokens: some Modelfiles include stop sequences that terminate generation early. Review the system prompt and template for unexpected stop patterns.

Where Can You Find More Help?

For hardware-specific issues on laptops (thermal throttling, battery drain), see How to Run Local LLMs on a Laptop. For security and privacy configuration questions, see the Local LLM Security & Privacy Checklist. The Ollama GitHub issues page (github.com/ollama/ollama/issues) and the r/LocalLLaMA subreddit are the most active community resources for model-specific bugs.

Sources

  • NVIDIA CUDA Toolkit Compatibility β€” Official version mapping for GPU support
  • llama.cpp Issues β€” Community discussion of common inference errors
  • Ollama Troubleshooting Guide β€” Official documentation for error resolution

Common Mistakes When Troubleshooting

  • Assuming OOM (out-of-memory) errors mean hardware failure β€” usually just means you need a smaller model or quantization.
  • Not checking system load β€” inference speed degrades significantly if other applications are consuming CPU/GPU.
  • Ignoring GPU driver version mismatches β€” NVIDIA CUDA requires specific driver versions for each CUDA version.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Try PromptQuorum free β†’

← Back to Local LLMs

Troubleshooting Local LLM Setup | PromptQuorum