Key Takeaways
- `ollama pull <model>` β Download a model (e.g., `ollama pull llama3.2:3b`).
- `ollama run <model>` β Start a chat with a model.
- `ollama list` β Show all downloaded models and their sizes.
- `ollama rm <model>` β Delete a downloaded model.
- `ollama serve` β Start the Ollama API server (runs automatically on Mac/Windows).
- `ollama create <name> -f <modelfile>` β Build a custom model from a Modelfile.
- As of April 2026, these commands are stable and cover all common use cases.
What Are the Essential Ollama Commands?
- `ollama list` β Show downloaded models, disk usage, and modification date.
- `ollama pull <model>` β Download a model by name (e.g., `ollama pull mistral`).
- `ollama run <model>` β Start a chat session with a model.
- `ollama rm <model>` β Delete a model and free up disk space.
- `ollama serve` β Start the REST API server (typically runs automatically).
- `ollama help` β Show all available commands.
How Do You Manage Models in Ollama?
Model management in Ollama is entirely command-based:
# List all downloaded models
ollama list
# Download a model from the Ollama library
ollama pull llama3.2:3b # 7-bit version (~2.5 GB)
ollama pull llama3.2:3b-fp16 # Full precision (~6.5 GB)
# Download specific quantization
ollama pull qwen2.5:7b-q4 # 4-bit quantization
ollama pull qwen2.5:7b-q8 # 8-bit quantization
# See disk usage
du -sh ~/.ollama/models
# Delete a model
ollama rm llama3.2:3b
# Pull from custom registry (advanced)
ollama pull localhost:5000/custom-modelHow Do You Run and Serve Models?
There are two ways to use Ollama:
# 1. Interactive chat (CLI)
ollama run llama3.2:3b
# Now type your prompts and press Enter
# 2. Start the API server (runs in background)
ollama serve
# API listens at http://localhost:11434/v1
# 3. Use the model via API from another terminal
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello"}]
}'How Do You Create Custom Models With Modelfiles?
A Modelfile is a configuration file (like a Dockerfile) that defines a custom model by starting from a base model and adding system prompts, parameters, and weights.
# Create a file named Modelfile
FROM llama3.2:3b
# Add a system prompt
SYSTEM """
You are a helpful expert in machine learning.
Always explain complex concepts in simple terms.
"""
# Adjust parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
# Build the custom model
ollama create ml-expert -f Modelfile
# Use it
ollama run ml-expertWhat Quantization Options Does Ollama Support?
Quantization reduces model size and VRAM by using lower-precision numbers. Ollama supports GGUF format with multiple quantizations:
| Quantization | Size (7B) | VRAM | Quality | Speed |
|---|---|---|---|---|
| FP16 (full precision) | 14 GB | 16 GB | Best | Slowest |
| Q8_0 (8-bit) | 7 GB | 8 GB | Excellent | Fast |
| Q6_K (6-bit) | 5.5 GB | 6 GB | Very good | Fast |
| Q5_K_M (5-bit) | 5 GB | 5.5 GB | Good | Very fast |
| Q4_K_M (4-bit) | 4.7 GB | 5 GB | Good | Very fast |
| Q3_K_M (3-bit) | 3.3 GB | 4 GB | Fair | Fastest |
How Do You Generate Embeddings With Ollama?
Embeddings are numerical representations of text, useful for RAG (Retrieval-Augmented Generation) and semantic search.
# Pull an embedding model
ollama pull nomic-embed-text # Best for English, 137M params
# Generate embeddings
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps"
}'
# Response includes embeddings as a vector of 768 dimensionsWhat Environment Variables Control Ollama?
Key environment variables:
- `OLLAMA_HOST` β Listen address (default: 127.0.0.1:11434). Set to `0.0.0.0:11434` for network access.
- `OLLAMA_MODELS` β Where to store models (default: `~/.ollama/models`).
- `OLLAMA_DEBUG` β Set to `1` for detailed logs.
- `OLLAMA_GPU` β GPU to use (default: auto-detect). Set to `cuda` or `rocm`.
- `OLLAMA_KEEP_ALIVE` β How long to keep model in memory (default: 5 minutes).
Common Mistakes With Ollama Commands
- Forgetting model tags. `ollama pull llama3.2` pulls the largest version; `ollama pull llama3.2:3b` pulls the 3B version.
- Not realizing `ollama serve` runs automatically. On Mac and Windows, Ollama starts the API automatically when you launch the app. On Linux, you may need to start it manually.
- Pulling the wrong quantization. Always specify the exact model tag (e.g., `qwen2.5:7b-q4`) to control VRAM usage.
- Expecting Ollama to work offline after pulling. Ollama itself works offline, but models must be pulled while connected to the internet.
Common Questions About Ollama Commands
Where are Ollama models stored?
Default: `~/.ollama/models` on macOS/Linux or `%USERPROFILE%\.ollama\models` on Windows. Set `OLLAMA_MODELS` to change the location.
Can I move models between computers?
Yes. Copy the model files from `~/.ollama/models` to another computer's `~/.ollama/models`, then `ollama list` will recognize them.
How do I see active model memory usage?
Use `ollama ps` to list currently-loaded models. Models are unloaded after 5 minutes of inactivity by default.
Can I run multiple models simultaneously?
Yes, but they share VRAM. Running two 8B models requires 16 GB VRAM. Each additional model increases memory usage.
Sources
- Ollama GitHub β github.com/ollama/ollama
- Ollama Documentation β github.com/ollama/ollama/blob/main/docs
- Ollama Model Library β ollama.ai/library