PromptQuorumPromptQuorum

Which Ollama Models Support 128K Context?

Quick Answer

Llama 3.1 8B supports 128K context on Ollama. Qwen 2.5 14B reaches 1M tokens. Note: running full context dramatically increases VRAM β€” a 128K window needs 3–4Γ— more VRAM than the default 4K window.

  • β–ΈLlama 3.1 8B: 128K context, ~16 GB VRAM at full context
  • β–ΈQwen 2.5 14B: up to 1M tokens, 24+ GB VRAM at full context
  • β–ΈSet --num-ctx 4096 for normal use to save VRAM

Updated: 2026-05

OllamaAdvanced

Key Takeaways

  • βœ“Most 7B Ollama models advertise 128K context but degrade in quality above 32K tokens
  • βœ“Llama 3.1 8B and Qwen 2.5 14B are the two models that deliver reliable quality at full 128K
  • βœ“A 128K context window can nearly triple VRAM usage β€” a 7B Q4 model needs ~15 GB at 128K vs ~5.5 GB at default
  • βœ“Set <code>--num-ctx 4096</code> for everyday tasks; only expand context when you need it

Which Models Actually Reach 128K

As of May 2026, most Ollama models advertise 128K context but fewer deliver useful output quality at that length. The problem is the "lost in the middle" effect: models trained on typical document lengths struggle to attend to information placed deep in a long context.

Two models reliably maintain quality at full 128K context on Ollama: Llama 3.1 8B (natively trained at 128K) and Qwen 2.5 14B (up to 1M tokens, though VRAM constraints make 128K the practical consumer limit). For most other 7B models, output quality degrades noticeably above 32K tokens.

If your task involves documents longer than 20,000 words, start with Llama 3.1 8B. If you need the strongest long-context quality and have 12+ GB VRAM, Qwen 2.5 14B is the better choice.

The VRAM Cost of Long Context

Expanding the context window increases VRAM usage significantly. The KV-cache, which stores attention state for all tokens in context, can add as much VRAM as the model weights themselves at 128K context.

The table below shows how KV-cache VRAM scales for a 7B model at Q4_K_M. These figures assume models using grouped query attention (GQA) β€” models without GQA use significantly more KV-cache.

To save VRAM on everyday tasks, set --num-ctx 4096 when running Ollama. Only expand to 32K or 128K when your specific task requires it. For the full guide on long-context local LLMs including model selection and RAM splitting, see the long-context local LLMs guide.

Context LengthKV-Cache (7B)Total VRAM (7B Q4)
4K (default)~0.5 GB~5.5 GB
16K~1.5 GB~6.5 GB
32K~3 GB~8 GB
128K~10 GB~15 GB

Quick Answers About Long Context Models

How do I enable 128K context in Ollama?β–Ύ
Add --num-ctx 131072 to your run command: ollama run llama3.1:8b --num-ctx 131072. Without this flag, Ollama defaults to 2048–4096 tokens regardless of the model's maximum capability.
Why does long context use so much VRAM?β–Ύ
The KV-cache stores attention state for every token in context. At 128K tokens, this cache can be as large as the model weights themselves. A 7B model at Q4 needs ~5.5 GB for weights but ~10 GB of KV-cache at 128K context.
Is 128K context useful for coding?β–Ύ
Yes, when working across large codebases. Fitting an entire repository or multiple files into context dramatically improves refactoring and cross-file reasoning tasks. For coding at 128K, Qwen 2.5 14B is the recommended model.
Which model is best for long-document analysis?β–Ύ
Qwen 2.5 14B at Q4_K_M is the top choice for long documents on Ollama β€” it maintains quality at full context length better than 7B alternatives. See Ollama vision models if you also need image understanding alongside long documents.