Which Ollama Models Support 128K Context?
Quick Answer
Llama 3.3 8B supports 128K context on Ollama. Qwen 3 14B reaches 1M tokens. Note: running full context dramatically increases VRAM β a 128K window needs 3β4Γ more VRAM than the default 4K window.
- βΈLlama 3.3 8B: 128K context, ~16 GB VRAM at full context
- βΈQwen 3 14B: up to 1M tokens, 24+ GB VRAM at full context
- βΈSet --num-ctx 4096 for normal use to save VRAM
Updated: 2026-05
Key Takeaways
- βMost 7B Ollama models advertise 128K context but degrade in quality above 32K tokens
- βLlama 3.3 8B and Qwen 3 14B are the two models that deliver reliable quality at full 128K
- βA 128K context window can nearly triple VRAM usage β a 7B Q4 model needs ~15 GB at 128K vs ~5.5 GB at default
- βSet <code>--num-ctx 4096</code> for everyday tasks; only expand context when you need it
Which Models Actually Reach 128K
As of May 2026, most Ollama models advertise 128K context but fewer deliver useful output quality at that length. The problem is the "lost in the middle" effect: models trained on typical document lengths struggle to attend to information placed deep in a long context.
Two models reliably maintain quality at full 128K context on Ollama: Llama 3.3 8B (natively trained at 128K) and Qwen 3 14B (up to 1M tokens, though VRAM constraints make 128K the practical consumer limit). For most other 7B models, output quality degrades noticeably above 32K tokens.
If your task involves documents longer than 20,000 words, start with Llama 3.3 8B. If you need the strongest long-context quality and have 12+ GB VRAM, Qwen 3 14B is the better choice.
The VRAM Cost of Long Context
Expanding the context window increases VRAM usage significantly. The KV-cache, which stores attention state for all tokens in context, can add as much VRAM as the model weights themselves at 128K context.
The table below shows how KV-cache VRAM scales for a 7B model at Q4_K_M. These figures assume models using grouped query attention (GQA) β models without GQA use significantly more KV-cache.
To save VRAM on everyday tasks, set --num-ctx 4096 when running Ollama. Only expand to 32K or 128K when your specific task requires it. For the full guide on long-context local LLMs including model selection and RAM splitting, see the long-context local LLMs guide.
| Context Length | KV-Cache (7B) | Total VRAM (7B Q4) |
|---|---|---|
| 4K (default) | ~0.5 GB | ~5.5 GB |
| 16K | ~1.5 GB | ~6.5 GB |
| 32K | ~3 GB | ~8 GB |
| 128K | ~10 GB | ~15 GB |
Quick Answers About Long Context Models
How do I enable 128K context in Ollama?βΎ
--num-ctx 131072 to your run command: ollama run llama3.1:8b --num-ctx 131072. Without this flag, Ollama defaults to 2048β4096 tokens regardless of the model's maximum capability.Why does long context use so much VRAM?βΎ
Is 128K context useful for coding?βΎ
Which model is best for long-document analysis?βΎ
Want the full breakdown?
Read the complete guide βRelated Prompt Bites