Quick Answer
Llama 3.1 8B supports 128K context on Ollama. Qwen 2.5 14B reaches 1M tokens. Note: running full context dramatically increases VRAM β a 128K window needs 3β4Γ more VRAM than the default 4K window.
Updated: 2026-05
Key Takeaways
As of May 2026, most Ollama models advertise 128K context but fewer deliver useful output quality at that length. The problem is the "lost in the middle" effect: models trained on typical document lengths struggle to attend to information placed deep in a long context.
Two models reliably maintain quality at full 128K context on Ollama: Llama 3.1 8B (natively trained at 128K) and Qwen 2.5 14B (up to 1M tokens, though VRAM constraints make 128K the practical consumer limit). For most other 7B models, output quality degrades noticeably above 32K tokens.
If your task involves documents longer than 20,000 words, start with Llama 3.1 8B. If you need the strongest long-context quality and have 12+ GB VRAM, Qwen 2.5 14B is the better choice.
Expanding the context window increases VRAM usage significantly. The KV-cache, which stores attention state for all tokens in context, can add as much VRAM as the model weights themselves at 128K context.
The table below shows how KV-cache VRAM scales for a 7B model at Q4_K_M. These figures assume models using grouped query attention (GQA) β models without GQA use significantly more KV-cache.
To save VRAM on everyday tasks, set --num-ctx 4096 when running Ollama. Only expand to 32K or 128K when your specific task requires it. For the full guide on long-context local LLMs including model selection and RAM splitting, see the long-context local LLMs guide.
| Context Length | KV-Cache (7B) | Total VRAM (7B Q4) |
|---|---|---|
| 4K (default) | ~0.5 GB | ~5.5 GB |
| 16K | ~1.5 GB | ~6.5 GB |
| 32K | ~3 GB | ~8 GB |
| 128K | ~10 GB | ~15 GB |
--num-ctx 131072 to your run command: ollama run llama3.1:8b --num-ctx 131072. Without this flag, Ollama defaults to 2048β4096 tokens regardless of the model's maximum capability.