Key Takeaways
- Out of Memory: Switch to smaller quantization (Q4_K_M β Q3_K_S) or smaller model.
- GPU not detected on NVIDIA: Update driver to 525+ on Linux, 452+ on Windows. Run `nvidia-smi` to confirm.
- Extremely slow inference: You are running on CPU only. Enable GPU offloading in Ollama with `OLLAMA_GPU_LAYERS` environment variable.
- Connection refused: Ollama is not running. Start it with `ollama serve` or restart the service.
- Garbled output: wrong prompt template. Use the Instruct variant of the model, not the base variant.
Error 1: "Out of Memory" / Out-of-Memory Crash
Out-of-Memory errors mean the model needs more RAM than available -- not a hardware failure. This is the most common error for first-time users. See LLM Quantization Explained for background on how quantization reduces RAM requirements.
- Check available RAM: Run `free -h` on macOS/Linux, or open Task Manager β Performance β Memory on Windows.
- Switch to smaller quantization: Replace `Q8_0` or `Q5_K_M` with `Q4_K_M`. For Ollama: `ollama run llama3.2-instruct-q4_K_M`.
- Close background applications before loading the model -- browsers and other apps consume RAM, reducing what the model has available.
- Switch to smaller model: if 8B fails on 8 GB RAM, try `llama3.2:3b` (requires only ~2.5 GB).
Check Available RAM on Linux / macOS
# Linux
free -h
# macOS
vm_stat | grep "Pages free"
# More readable on macOS
top -l 1 | grep "PhysMem"Error 2: GPU Not Being Used (Running on CPU Only)
GPU not being used means the LLM runs 5β10Γ slower than expected -- check driver installation before anything else. Verify that your GPU is visible to the system:
# NVIDIA β should show GPU name and driver version
nvidia-smi
# AMD on Linux
rocm-smi
# macOS β check if Metal is available
system_profiler SPDisplaysDataType | grep "Metal"How Do You Enable GPU in Ollama?
- NVIDIA on Linux: Install NVIDIA driver 525+ and CUDA Toolkit 11.3+. Ollama auto-detects CUDA on restart.
- NVIDIA on Windows: Ensure driver version is 452.39 or higher. Ollama automatically installs CUDA support via the Windows installer.
- AMD on Linux: Install ROCm 5.7+. If detection fails, set `HSA_OVERRIDE_GFX_VERSION=11.0.0` for RX 6000-series cards.
- Apple Silicon: Ollama uses Metal by default -- no configuration needed. Confirm with `ollama ps` after loading a model; GPU layers appear in output.
Error 3: Inference Is Extremely Slow (Under 5 Tokens/Second)
Under 5 tokens/second means the model is running on CPU only or the model is too large for available VRAM. A 7B model on GPU generates 30β80 tok/s; the same model on CPU generates 3β10 tok/s.
- Confirm whether GPU is active: Run `ollama ps` while a model is loaded. The output shows how many layers are on GPU vs CPU.
- Reduce model size: a 13B model on CPU generates 3β6 tok/s. Switching to 7B doubles the speed; switching to 3B quadruples it.
- Increase GPU layers in Ollama: Set `OLLAMA_GPU_LAYERS=999` to push all layers to GPU (Ollama will cap to what fits in VRAM).
- Use faster quantization: Q4_K_M is the fastest quantization that maintains acceptable quality. Q8_0 is higher quality but ~30% slower.
Set GPU Layers in Ollama
# Set environment variable before starting Ollama
export OLLAMA_GPU_LAYERS=999
ollama serve
# Or in a Modelfile
FROM llama3.1:8b
PARAMETER num_gpu 999Error 4: "Connection Refused" When Calling the API
Connection Refused means Ollama is not running -- the API at `localhost:11434` only responds when the service is active. Start it before making API calls.
# Start Ollama manually
ollama serve
# On Linux -- restart the systemd service
systemctl restart ollama
# Verify it is running
curl http://localhost:11434
# Expected: "Ollama is running"Error 5: "Model Not Found" Error
"Model not found" means the model name in your command does not match any downloaded model. Model names in Ollama are case-sensitive and include version tags.
# List all downloaded models
ollama list
# Pull a model if it is missing
ollama pull llama3.2
# Check the exact model name -- tags matter
# "llama3.2" and "llama3.2:3b" are different entriesError 6: Corrupted Model File
Corrupted model files are caused by interrupted downloads -- delete and re-pull to fix. Ollama does not always auto-detect partial downloads.
# Remove the corrupted model
ollama rm llama3.2
# Re-pull it
ollama pull llama3.2
# For LM Studio: manually delete model files
# Default location: ~/.cache/lm-studio/models/Error 6b: "Failed to Resolve Model" in LM Studio
"Failed to resolve model lmstudio-community/..." means LM Studio cannot find the model in its registry. This typically happens when a model is downloaded from `lmstudio-community` on Hugging Face but the registry reference has changed. LM Studio is using a cached registry entry that no longer matches available model files.
- Open LM Studio β My Models tab β click the three-dot menu on the failed model β select "Delete model" (keeps the file, removes registry)
- Search for the same model in the model browser and re-download it -- LM Studio will re-register it
- Alternative: quit LM Studio, navigate to `~/.cache/lm-studio/models/`, delete the specific model folder, then re-download
# Manually clear LM Studio model cache (macOS/Linux)
rm -rf ~/.cache/lm-studio/models/lmstudio-community/<model-name>Error 7: CUDA / ROCm Initialization Errors
CUDA and ROCm errors mean driver/library version mismatch -- update your driver to the required minimum version.
- "CUDA driver version insufficient": Update NVIDIA driver. The minimum for llama.cpp is CUDA 11.3 / driver 450.80.
- "No kernel image available for execution": Your GPU architecture is unsupported. GTX 900-series (Maxwell) and older are not supported by recent CUDA builds.
- AMD ROCm "HSA_STATUS_ERROR_INVALID_ISA": Set `HSA_OVERRIDE_GFX_VERSION=10.3.0` (for RX 6000) or `11.0.0` (for RX 7000) before starting Ollama.
- Check CUDA version: Run `nvcc --version` or `nvidia-smi | grep CUDA`.
Error 8: Garbled, Repetitive, or Nonsensical Output
Garbled output almost always means you are using a base model instead of an instruct/chat variant. Base models generate raw text completions, not answers to questions.
Base models (e.g., `llama3.1:8b`) are not fine-tuned for conversation, and when prompted with a question, generate raw completions that look like gibberish. Always use the instruct variant: `llama3.1:8b-instruct`. See How to Install LM Studio for a GUI-based method to switch model variants.
In Ollama, the default tag for most models already points to the instruct variant. If you manually downloaded from Hugging Face, confirm the filename includes "Instruct" or "chat".
Error 9: "Address Already in Use" -- Port Conflict
"Address already in use" means another process is occupying port 11434 (Ollama) or 1234 (LM Studio). Find and kill the conflicting process.
# Find what is using port 11434 (Ollama)
lsof -i :11434
# Kill it by PID
kill -9 <PID>
# Or change Ollama port
export OLLAMA_HOST=0.0.0.0:11435
ollama serveError 10: Model Stops Generating Mid-Response
Stopping mid-response is caused by reaching context length limits or `num_predict` set too low. The default `num_predict` in many configurations is 128 tokens -- just enough for 1β2 sentences.
- Increase num_predict: This parameter sets the maximum tokens to generate. Default is often 128. Increase it: In Ollama, add `PARAMETER num_predict 2048` to the Modelfile.
- Check context window: If your conversation is very long, the model may hit its context limit. Start a new session or use a model with a larger context window (Llama 3.2 3B supports 128K).
- Check stop tokens: Some Modelfiles include stop sequences that terminate generation early. Check the system prompt and template for unexpected stop patterns.
Where to Find More Help
For hardware-specific issues on laptops (thermal throttling, battery drain), see How to Run Local LLMs on a Laptop. For security and privacy configuration questions, see Local LLM Security & Privacy Checklist. The Ollama GitHub issues page (github.com/ollama/ollama/issues) and r/LocalLLaMA subreddit are the most active community resources for model-specific bugs.
Common Mistakes in Local LLM Troubleshooting
- Confusing OOM errors with hardware failure -- the error means RAM is too small for the model, not that hardware is broken. Fix: use Q4_K_M quantization or smaller model.
- Not checking system load -- inference speed degrades significantly when other applications consume CPU/GPU. Close your browser, video player, and background processes before benchmarking.
- Ignoring driver version incompatibility -- NVIDIA CUDA requires specific driver versions per CUDA release. Check `nvidia-smi` output; driver version must be β₯450.80 for CUDA 11.x.
- Using wrong model name in Ollama -- `llama3.2` and `llama3.2:3b` are different Ollama tags. Run `ollama list` to see the exact names of downloaded models.
- Not restarting Ollama after driver update -- Ollama detects GPU at startup. After updating NVIDIA or ROCm drivers, fully restart Ollama (`ollama serve`) to re-detect GPU.
Sources
- NVIDIA. (2024). "CUDA Toolkit Release Notes." https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/ β Official CUDA driver version requirements per release.
- Ollama. (2026). "Ollama Troubleshooting." https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md β Official Ollama documentation for common errors.
- AMD. (2024). "ROCm Installation Guide." https://rocm.docs.amd.com/projects/install-on-linux/en/latest/ β Official AMD ROCm installation and Linux GPU support.