Key Takeaways
- Out of memory: switch to a smaller quantization (Q4_K_M β Q3_K_S) or a smaller model.
- GPU not detected on NVIDIA: update drivers to 525+ on Linux, 452+ on Windows. Run `nvidia-smi` to confirm.
- Very slow inference: you are running on CPU only. Enable GPU offloading in Ollama with the `OLLAMA_GPU_LAYERS` env var.
- Connection refused: Ollama is not running. Start it with `ollama serve` or restart the service.
- Garbled output: wrong prompt template. Use the instruct variant of the model, not the base variant.
Error 1: "Not Enough Memory" / Out-of-Memory Crash
The model requires more RAM than is available. This is the most common error for first-time users.
- Check available RAM: on macOS/Linux run `free -h`, on Windows open Task Manager β Performance β Memory.
- Switch to a smaller quantization: replace `Q8_0` or `Q5_K_M` with `Q4_K_M`. For Ollama: `ollama run llama3.1:8b-instruct-q4_K_M`.
- Close background applications before loading the model β browsers and other apps consume RAM that reduces what is available for the model.
- Switch to a smaller model: if 8B is failing on 8 GB RAM, try `llama3.2:3b` (requires only ~2.5 GB).
How Do You Check Available RAM on Linux / macOS?
# Linux
free -h
# macOS
vm_stat | grep "Pages free"
# More readable on macOS
top -l 1 | grep "PhysMem"Error 2: GPU Is Not Being Used (Running on CPU Only)
Verify your GPU is visible to the system before debugging the LLM tool:
# NVIDIA β should show GPU name and driver version
nvidia-smi
# AMD on Linux
rocm-smi
# macOS β check Metal is available
system_profiler SPDisplaysDataType | grep "Metal"How Do You Enable GPU in Ollama?
- NVIDIA on Linux: install NVIDIA driver 525+ and CUDA toolkit 11.3+. Ollama detects CUDA automatically on restart.
- NVIDIA on Windows: ensure driver version 452.39 or higher. Ollama installs CUDA support automatically via the Windows installer.
- AMD on Linux: install ROCm 5.7+. Set `HSA_OVERRIDE_GFX_VERSION=11.0.0` for RX 6000-series cards if detection fails.
- Apple Silicon: Ollama uses Metal by default β no configuration needed. Confirm with `ollama ps` after starting a model; GPU layers appear in the output.
Error 3: Inference Is Very Slow (Under 5 Tokens/sec)
If generation is under 5 tokens/sec, the model is running on CPU only, or you are running too large a model for your hardware.
- Confirm whether GPU is active: run `ollama ps` while a model is loaded. The output shows how many layers are on GPU vs CPU.
- Reduce model size: a 13B model on CPU generates 3β6 tok/sec. Switching to 7B doubles speed; switching to 3B quadruples it.
- Increase GPU layers in Ollama: set `OLLAMA_GPU_LAYERS=999` to push all layers to GPU (Ollama will cap at what fits in VRAM).
- Use a faster quantization: Q4_K_M is the fastest quantization that maintains acceptable quality. Q8_0 is higher quality but ~30% slower.
How Do You Set GPU Layers in Ollama?
# Set environment variable before starting Ollama
export OLLAMA_GPU_LAYERS=999
ollama serve
# Or in a Modelfile
FROM llama3.1:8b
PARAMETER num_gpu 999Error 4: "Connection Refused" When Calling the API
The Ollama server is not running. The API at `localhost:11434` only responds when the Ollama service is active.
# Start Ollama manually
ollama serve
# On Linux β restart the systemd service
systemctl restart ollama
# Verify it is running
curl http://localhost:11434
# Expected: "Ollama is running"Error 5: "Model Not Found" Error
This error means the model name in your command does not match any downloaded model.
# List all downloaded models
ollama list
# Pull the model if it is missing
ollama pull llama3.2
# Check exact model name β tags matter
# "llama3.2" and "llama3.2:3b" are different entriesError 6: Corrupted Model File
If a model download was interrupted, the cached file may be incomplete. Ollama does not always detect partial downloads.
# Remove the corrupted model
ollama rm llama3.2
# Re-pull it
ollama pull llama3.2
# For LM Studio: delete the model file manually
# Default location: ~/.cache/lm-studio/models/Error 7: CUDA / ROCm Initialization Errors
CUDA and ROCm errors typically mean a driver/library version mismatch.
- "CUDA driver version is insufficient": update NVIDIA drivers. The minimum for llama.cpp is CUDA 11.3 / driver 450.80.
- "no kernel image is available for execution": your GPU architecture is not supported. GTX 900-series (Maxwell) and older are not supported by recent CUDA builds.
- AMD ROCm "HSA_STATUS_ERROR_INVALID_ISA": set `HSA_OVERRIDE_GFX_VERSION=10.3.0` (for RX 6000) or `11.0.0` (for RX 7000) before starting Ollama.
- Check CUDA version: run `nvcc --version` or `nvidia-smi | grep CUDA`.
Error 8: Garbled, Repetitive, or Nonsensical Output
Garbled output almost always means you are using a base model instead of the instruct/chat variant, or the wrong prompt template is being applied.
Base models (e.g., `llama3.1:8b`) are not fine-tuned for conversation and produce raw completions that look like garbled text when prompted with a question. Always use the instruct variant: `llama3.1:8b-instruct`.
In Ollama, the default tag for most models already points to the instruct variant. If you downloaded from Hugging Face manually, confirm the filename includes "Instruct" or "chat".
Error 9: "Address Already in Use" β Port Conflict
Another process is using the port that Ollama or LM Studio needs.
# Find what is using port 11434 (Ollama)
lsof -i :11434
# Kill it by PID
kill -9 <PID>
# Or change Ollama's port
export OLLAMA_HOST=0.0.0.0:11435
ollama serveError 10: Model Stops Generating Mid-Response
Mid-response stops are caused by hitting the context length limit or a generation parameter set too low.
- Increase num_predict: this parameter sets the maximum number of tokens to generate. Default is often 128. Increase it: in Ollama, add `PARAMETER num_predict 2048` to a Modelfile.
- Check context window: if your conversation is very long, the model may be hitting its context limit. Start a fresh session or use a model with a larger context window (Llama 3.2 supports 128K).
- Check for stop tokens: some Modelfiles include stop sequences that terminate generation early. Review the system prompt and template for unexpected stop patterns.
Where Can You Find More Help?
For hardware-specific issues on laptops (thermal throttling, battery drain), see How to Run Local LLMs on a Laptop. For security and privacy configuration questions, see the Local LLM Security & Privacy Checklist. The Ollama GitHub issues page (github.com/ollama/ollama/issues) and the r/LocalLLaMA subreddit are the most active community resources for model-specific bugs.
Sources
- NVIDIA CUDA Toolkit Compatibility β Official version mapping for GPU support
- llama.cpp Issues β Community discussion of common inference errors
- Ollama Troubleshooting Guide β Official documentation for error resolution
Common Mistakes When Troubleshooting
- Assuming OOM (out-of-memory) errors mean hardware failure β usually just means you need a smaller model or quantization.
- Not checking system load β inference speed degrades significantly if other applications are consuming CPU/GPU.
- Ignoring GPU driver version mismatches β NVIDIA CUDA requires specific driver versions for each CUDA version.