PromptQuorumPromptQuorum
Accueil/LLMs locaux/How Do You Run Your First Local LLM: From Install to First Response in 10 Minutes
Getting Started

How Do You Run Your First Local LLM: From Install to First Response in 10 Minutes

·7 min read·Par Hans Kuepper · Fondateur de PromptQuorum, outil de dispatch multi-modèle · PromptQuorum

Running your first local LLM takes under 10 minutes with Ollama. Install Ollama, run one command to pull a model, and start chatting in your terminal β€” no API key, no account, and no internet connection after the initial download. As of April 2026, the fastest beginner model is Llama 3.2 3B at 25–45 tokens/sec on a modern laptop CPU.

Points clΓ©s

  • The fastest path: install Ollama β†’ run `ollama run llama3.2` β†’ chat in your terminal. Total time: under 5 minutes on a fast connection.
  • For 8 GB RAM machines: start with `llama3.2:3b` (2 GB download) or `phi3:mini` (2.3 GB). Both run on any modern laptop.
  • Expect 15–40 tokens/sec on CPU, 60–120 tokens/sec on a mid-range GPU or Apple Silicon.
  • First responses may feel slower than cloud APIs β€” local models trade speed for privacy and zero cost.
  • After the initial model download, everything runs offline. No internet needed for subsequent sessions.

Step 1: How Do You Install Ollama?

Ollama is the fastest way to run a local LLM. Install it with one command or a 2-minute download:

bash
# macOS (Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com/download

How Do You Verify Ollama Is Running?

After installation, confirm Ollama is active:

bash
curl http://localhost:11434
# Expected output: Ollama is running

Step 2: Which Model Should You Choose?

Pick a model based on your available RAM. When in doubt, start with `llama3.2:3b` β€” it runs on any machine with 4 GB of RAM and produces useful output:

Your RAMRecommended ModelDownload SizeWhy
4 GBllama3.2:1b~1.3 GBSmallest usable Llama model
8 GBllama3.2:3b~2 GBBest quality/size ratio for beginners
8–16 GBllama3.1:8b~4.7 GBStrong general-purpose model
16+ GBmistral:7b or qwen2.5:7b~4–5 GBCompetitive quality, fast inference

Step 3: How Do You Pull the Model?

Download the model with `ollama pull`. The model is saved to `~/.ollama/models` and only needs to be downloaded once:

bash
ollama pull llama3.2

# Or pull a specific size variant
ollama pull llama3.2:3b
ollama pull llama3.1:8b

What Does the Download Look Like?

pulling manifest pulling 966de95ca8dc... 100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– 1.9 GB pulling 9f436a92eb8b... 100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– 42 B verifying sha256 digest writing manifest success

β€” Ollama terminal output during model pull

Ollama shows download progress in the terminal. A `llama3.2:3b` model takes 2–5 minutes on a typical broadband connection. The model is stored compressed β€” the 2 GB download expands to approximately 2.3 GB on disk.

Step 4: How Do You Run the Model and Send Your First Prompt?

Start an interactive chat session:

bash
ollama run llama3.2

# Ollama loads the model and shows a prompt:
>>> Send a message (/? for help)

Your First Conversation

Type a message and press Enter. The model streams its response token by token:

text
>>> What are local LLMs?

Local LLMs (large language models) are AI models that run entirely
on your own hardware β€” your laptop, desktop, or server. Unlike cloud
services such as ChatGPT or Claude, local LLMs process everything
locally with no data sent to external servers...

What to Expect: Speed, Quality, and Limitations

Speed varies by hardware. On a 2023 laptop (no GPU): expect 15–25 tokens/sec for a 3B model and 8–15 tokens/sec for an 8B model. On Apple M3 Pro: 50–80 tokens/sec for 8B. On NVIDIA RTX 4070 Ti: 90–130 tokens/sec for 8B.

Quality from `llama3.2:3b` is noticeably lower than GPT-4o or Claude 4.6 Sonnet on complex tasks. For summarization, simple Q&A, and code explanation, the output is useful. For multi-step reasoning or long-form writing, upgrade to an 8B or 13B model.

Context window: `llama3.2:3b` supports 128K tokens by default in Ollama. In practice, quality degrades after ~16K tokens in a single conversation.

First response delay: the first response after `ollama run` includes model loading time (5–30 seconds). Subsequent responses in the same session are faster.

How Do You Use Your Local LLM Beyond the Terminal?

The Ollama terminal chat is useful for testing, but most real use cases need a better interface:

  • Open WebUI: a full-featured web UI for Ollama. Run it with Docker: `docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main`. Access at http://localhost:3000.
  • LM Studio: if you prefer a desktop GUI, How to Install LM Studio covers the full setup. LM Studio's built-in chat is polished and supports conversation history.
  • API integration: Ollama's API at `localhost:11434` is compatible with the OpenAI SDK. Any application that accepts an OpenAI base URL can connect to your local model.
  • VS Code / Cursor: extensions like Continue.dev connect to Ollama and provide local AI coding assistance directly in your editor.

What Are Common Questions When Running Your First Local LLM?

The model response is very slow β€” is this normal?

On CPU-only hardware, 8–20 tokens/sec is normal for a 7B model. Each token is roughly 0.75 words. At 10 tokens/sec, a 100-word response takes about 13 seconds. To speed up inference, use a smaller model (3B instead of 8B), enable GPU offloading if you have a compatible GPU, or use quantization level Q4_K_M which is the fastest common setting.

Can I run two models at the same time?

Ollama can keep multiple models loaded simultaneously if you have enough RAM. By default, Ollama unloads a model after 5 minutes of inactivity. You can change this with the OLLAMA_KEEP_ALIVE environment variable. Running two 7B models simultaneously requires ~16 GB of RAM.

How do I stop Ollama from running in the background?

On macOS: click the llama icon in the menu bar and select Quit. On Linux: run `systemctl stop ollama`. On Windows: right-click the system tray icon and select Quit. To prevent Ollama from starting on login, remove it from your startup items.

What Are Your Next Steps After Your First Run?

Now that you have a working local LLM, explore what it can do. To understand which models perform best for your hardware, see Best Beginner Local LLM Models. For laptop-specific performance tips, see How to Run Local LLMs on a Laptop. For privacy and security best practices, see the Local LLM Security & Privacy Checklist.

Sources

  • Ollama Model Library Documentation β€” Official list of models and specifications
  • Token Prediction Benchmarks β€” Community performance data across hardware
  • Llama 3.2 Model Card β€” Official specifications and performance metrics

What Are Common Mistakes After Your First Run?

  • Confusing token count with speed β€” a 7B model generating 100 tokens at 20 tokens/sec takes 5 seconds, not instant.
  • Running inference while the system is busy with other tasks, reducing effective tokens/sec significantly.
  • Not checking context window limits β€” most beginner models support 2K–8K tokens, not the 100K+ of frontier models.

Comparez votre LLM local avec 25+ modèles cloud simultanément avec PromptQuorum.

Essayer PromptQuorum gratuitement β†’

← Retour aux LLMs locaux

Run Your First Local LLM | PromptQuorum