ๅ ณ้ฎ่ฆ็น
- The fastest path: install Ollama โ run `ollama run llama3.2` โ chat in your terminal. Total time: under 5 minutes on a fast connection.
- For 8 GB RAM machines: start with `llama3.2:3b` (2 GB download) or `phi3:mini` (2.3 GB). Both run on any modern laptop.
- Expect 15โ40 tokens/sec on CPU, 60โ120 tokens/sec on a mid-range GPU or Apple Silicon.
- First responses may feel slower than cloud APIs โ local models trade speed for privacy and zero cost.
- After the initial model download, everything runs offline. No internet needed for subsequent sessions.
Step 1: How Do You Install Ollama?
Ollama is the fastest way to run a local LLM. Install it with one command or a 2-minute download:
# macOS (Homebrew)
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from ollama.com/downloadHow Do You Verify Ollama Is Running?
After installation, confirm Ollama is active:
curl http://localhost:11434
# Expected output: Ollama is runningStep 2: Which Model Should You Choose?
Pick a model based on your available RAM. When in doubt, start with `llama3.2:3b` โ it runs on any machine with 4 GB of RAM and produces useful output:
| Your RAM | Recommended Model | Download Size | Why |
|---|---|---|---|
| 4 GB | llama3.2:1b | ~1.3 GB | Smallest usable Llama model |
| 8 GB | llama3.2:3b | ~2 GB | Best quality/size ratio for beginners |
| 8โ16 GB | llama3.1:8b | ~4.7 GB | Strong general-purpose model |
| 16+ GB | mistral:7b or qwen2.5:7b | ~4โ5 GB | Competitive quality, fast inference |
Step 3: How Do You Pull the Model?
Download the model with `ollama pull`. The model is saved to `~/.ollama/models` and only needs to be downloaded once:
ollama pull llama3.2
# Or pull a specific size variant
ollama pull llama3.2:3b
ollama pull llama3.1:8bWhat Does the Download Look Like?
pulling manifest pulling 966de95ca8dc... 100% โโโโโโโโโโโโโโโโโโ 1.9 GB pulling 9f436a92eb8b... 100% โโโโโโโโโโโโโโโโโโ 42 B verifying sha256 digest writing manifest success
Ollama shows download progress in the terminal. A `llama3.2:3b` model takes 2โ5 minutes on a typical broadband connection. The model is stored compressed โ the 2 GB download expands to approximately 2.3 GB on disk.
Step 4: How Do You Run the Model and Send Your First Prompt?
Start an interactive chat session:
ollama run llama3.2
# Ollama loads the model and shows a prompt:
>>> Send a message (/? for help)Your First Conversation
Type a message and press Enter. The model streams its response token by token:
>>> What are local LLMs?
Local LLMs (large language models) are AI models that run entirely
on your own hardware โ your laptop, desktop, or server. Unlike cloud
services such as ChatGPT or Claude, local LLMs process everything
locally with no data sent to external servers...What to Expect: Speed, Quality, and Limitations
Speed varies by hardware. On a 2023 laptop (no GPU): expect 15โ25 tokens/sec for a 3B model and 8โ15 tokens/sec for an 8B model. On Apple M3 Pro: 50โ80 tokens/sec for 8B. On NVIDIA RTX 4070 Ti: 90โ130 tokens/sec for 8B.
Quality from `llama3.2:3b` is noticeably lower than GPT-4o or Claude 4.6 Sonnet on complex tasks. For summarization, simple Q&A, and code explanation, the output is useful. For multi-step reasoning or long-form writing, upgrade to an 8B or 13B model.
Context window: `llama3.2:3b` supports 128K tokens by default in Ollama. In practice, quality degrades after ~16K tokens in a single conversation.
First response delay: the first response after `ollama run` includes model loading time (5โ30 seconds). Subsequent responses in the same session are faster.
How Do You Use Your Local LLM Beyond the Terminal?
The Ollama terminal chat is useful for testing, but most real use cases need a better interface:
- Open WebUI: a full-featured web UI for Ollama. Run it with Docker: `docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main`. Access at http://localhost:3000.
- LM Studio: if you prefer a desktop GUI, How to Install LM Studio covers the full setup. LM Studio's built-in chat is polished and supports conversation history.
- API integration: Ollama's API at `localhost:11434` is compatible with the OpenAI SDK. Any application that accepts an OpenAI base URL can connect to your local model.
- VS Code / Cursor: extensions like Continue.dev connect to Ollama and provide local AI coding assistance directly in your editor.
What Are Common Questions When Running Your First Local LLM?
The model response is very slow โ is this normal?
On CPU-only hardware, 8โ20 tokens/sec is normal for a 7B model. Each token is roughly 0.75 words. At 10 tokens/sec, a 100-word response takes about 13 seconds. To speed up inference, use a smaller model (3B instead of 8B), enable GPU offloading if you have a compatible GPU, or use quantization level Q4_K_M which is the fastest common setting.
Can I run two models at the same time?
Ollama can keep multiple models loaded simultaneously if you have enough RAM. By default, Ollama unloads a model after 5 minutes of inactivity. You can change this with the OLLAMA_KEEP_ALIVE environment variable. Running two 7B models simultaneously requires ~16 GB of RAM.
How do I stop Ollama from running in the background?
On macOS: click the llama icon in the menu bar and select Quit. On Linux: run `systemctl stop ollama`. On Windows: right-click the system tray icon and select Quit. To prevent Ollama from starting on login, remove it from your startup items.
What Are Your Next Steps After Your First Run?
Now that you have a working local LLM, explore what it can do. To understand which models perform best for your hardware, see Best Beginner Local LLM Models. For laptop-specific performance tips, see How to Run Local LLMs on a Laptop. For privacy and security best practices, see the Local LLM Security & Privacy Checklist.
Sources
- Ollama Model Library Documentation โ Official list of models and specifications
- Token Prediction Benchmarks โ Community performance data across hardware
- Llama 3.2 Model Card โ Official specifications and performance metrics
What Are Common Mistakes After Your First Run?
- Confusing token count with speed โ a 7B model generating 100 tokens at 20 tokens/sec takes 5 seconds, not instant.
- Running inference while the system is busy with other tasks, reducing effective tokens/sec significantly.
- Not checking context window limits โ most beginner models support 2Kโ8K tokens, not the 100K+ of frontier models.