PromptQuorumPromptQuorum
Home/Local LLMs/Ollama Context Window Configuration: 64Kโ€“1M Tokens on Strix Halo, RTX, Mac 2026
Best Models

Ollama Context Window Configuration: 64Kโ€“1M Tokens on Strix Halo, RTX, Mac 2026

ยท8 min readยทBy Hans Kuepper ยท Founder of PromptQuorum, multi-model AI dispatch tool ยท PromptQuorum

Llama 4 Scout supports up to 10M token context (practical: 256K-1M on consumer hardware). DeepSeek V4-Flash delivers 1M tokens. Qwen 3.6 supports 256K natively (extendable to 1M via YaRN). While 7B-8B models stay reliable at 16K-32K tokens, new MoE models and 70B+ variants extend practical limits to 256K-1M. Ollama defaults to 2048 -- set num_ctx explicitly to use long context.

The 2026 context window revolution is here. Llama 4 Scout supports up to 10M token context (practical: 256K-1M), DeepSeek V4-Flash delivers 1M tokens, and Qwen 3.6 natively supports 256K tokens (extendable to 1M via YaRN). While most 7B-8B models plateau at 16K-32K practical context, new MoE models push practical limits to 256K-1M tokens on consumer hardware. Ollama defaults to 2048 tokens -- this guide shows which models support what, RAM requirements at each tier, and how to configure long context.

Slide Deck: Ollama Context Window Configuration: 64Kโ€“1M Tokens on Strix Halo, RTX, Mac 2026

The slide deck below covers: 128K context window models compared (Llama 3.1, Qwen2.5, Mistral Small 3.1), RAM usage at 4K/32K/128K context lengths, the "lost in the middle" effect and practical reliable limits (~32K for 7B models), and how to set num_ctx in Ollama. Download the PDF as a long context local LLM reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • Llama 4 Scout (MoE) supports up to 10M tokens. DeepSeek V4-Flash and Qwen 3.6 support 1M and 256K tokens respectively (extendable to 1M via YaRN). May 2026 marks the first generation of million-token-capable open models.
  • Practical context by model size: 7B-8B models maintain quality at 16K-32K tokens. 70B+ models and MoE models extend this to 256K-1M tokens. Llama 4 Scout can handle full million-token contexts on sufficient VRAM.
  • RAM scales with context length AND model size. Qwen 3.6 27B at Q4_K_M needs ~22 GB at 128K, ~65+ GB at 1M tokens. Llama 4 Scout needs 150+ GB for full 10M context.
  • Lost in the Middle still applies: LLMs miss details from middle sections of the context. Mitigation: keep critical info at prompt start, use RAG for search, or process in overlapping chunks.
  • Long context excels for holistic analysis of complete documents (codebases, contracts, books). RAG excels for search-heavy tasks across many documents. Choose by task type, not context size alone.
  • Ollama defaults to 2048 tokens -- not 128K or 1M. Set num_ctx explicitly in a Modelfile to access full context. For massive contexts (500K+), tune attention implementation to avoid OOM.

What Is Context Length and Why Does It Matter for Local LLMs?

Context length is the maximum number of tokens a model can process in a single inference call -- the combined size of the input (your document, conversation history, system prompt) and the output (the model's response). One token โ‰ˆ 0.75 words in English; 128K tokens โ‰ˆ 96,000 words.

For local LLM use cases, long context enables: summarizing entire books or long reports, analyzing full codebases in one prompt, processing hours of meeting transcripts, and maintaining long conversation histories without losing earlier context.

The key distinction is between the advertised context length (what the model architecture supports) and the practical context length (where quality stays reliable). A model may technically support 128K tokens but show degraded quality on information presented at the 100K token mark.

Which Local LLMs Support 128K Token Context in 2026?

ModelContext WindowPractical LimitOllama Command
Llama 3.1 8B128K~32K reliableollama run llama3.2
Llama 3.2 3B128K~16K reliableollama run llama3.2:3b
Llama 3.3 70B128K~64K reliableollama run llama3.3:70b
Qwen2.5 7B128K~32K reliableollama run qwen2.5:7b
Qwen2.5 72B128K~64K reliableollama run qwen2.5:72b
Mistral Small 3.1 24B128K~32K reliableollama run mistral-small3.1
Gemma 2 2B8K~6K reliableollama run gemma2:2b
Mistral 7B v0.332K~16K reliableollama run llama3.2
6 local LLM models with 128K context support -- practical reliable limit is 32K for 7B models, 64K for 70B models.
6 local LLM models with 128K context support -- practical reliable limit is 32K for 7B models, 64K for 70B models.

How Much RAM Does Long Context Processing Require?

RAM usage scales with both model size and context length. The KV cache (key-value cache) stores attention states for all processed tokens -- this grows linearly with context length.

As of April 2026, a 7B model at Q4_K_M with 4K context uses ~6 GB RAM. The same model with 32K context uses ~8-9 GB RAM. With 128K context: ~12-16 GB RAM.

Model4K Context32K Context128K Context
Llama 3.1 8B Q4_K_M~6 GB~9 GB~14 GB
Qwen2.5 14B Q4_K_M~9 GB~12 GB~18 GB
Mistral Small 3.1 24B Q4_K_M~14 GB~17 GB~24 GB
Llama 3.3 70B Q4_K_M~40 GB~45 GB~55 GB
KV cache RAM scales with context length -- a 7B model at Q4_K_M needs ~6 GB at 4K context but ~14 GB at 128K context.
KV cache RAM scales with context length -- a 7B model at Q4_K_M needs ~6 GB at 4K context but ~14 GB at 128K context.

Why Is Practical Context Length Shorter Than the Advertised Maximum?

LLMs trained with RoPE positional encodings (used by Llama, Qwen, Mistral) can technically process tokens up to their maximum context length, but quality degrades in a known pattern called the "lost in the middle" effect.

Research shows that language models are best at using information at the beginning and end of the context window. Information placed in the middle of a very long context is retrieved less reliably. In practice, this means a model with a 128K context window may reliably answer questions about content in the first 32K tokens and last 16K tokens, but miss details from the 40K-80K token range.

For local models specifically, the practical reliable limit scales with model size: 3B models โ‰ˆ 8K-16K reliable; 7B-8B models โ‰ˆ 16K-32K reliable; 70B models โ‰ˆ 64K reliable. These are approximate -- the actual limit depends on the specific task and how "important" the retrieved information is.

Long context windows enable more input, but prompt structure determines whether the model uses that context effectively. Techniques like RAG, prompt chaining, and context window management strategies are covered in the prompt engineering guide.

The "lost in the middle" effect: LLMs reliably recall content at start and end of the context window but miss the 40K-80K token range.
The "lost in the middle" effect: LLMs reliably recall content at start and end of the context window but miss the 40K-80K token range.

How Do You Set Context Length in Ollama?

Ollama defaults to 2048 tokens of context unless configured otherwise. To use a model's full context window:

Context window size determines how much text a model can process, but prompt structure determines how effectively it uses that context. For a deep dive into why models lose track of earlier input and strategies to mitigate it, see context windows explained: why AI forgets.

bash
# Set context length at runtime
ollama run llama3.2 --ctx 32768

# Or create a custom model with a Modelfile
cat << EOF > Modelfile
FROM llama3.1:8b
PARAMETER num_ctx 32768
EOF
ollama create llama3.1-32k -f Modelfile
ollama run llama3.1-32k
Setting num_ctx 32768 in a Modelfile unlocks 32K context in Ollama -- verified with `ollama ps` showing CTX column.
Setting num_ctx 32768 in a Modelfile unlocks 32K context in Ollama -- verified with `ollama ps` showing CTX column.

Long Context Local LLMs: Regional Context

EU / GDPR + AI Act: The EU AI Act (effective February 2025) classifies AI systems processing personal data at scale as potentially high-risk. Long-context local inference for legal document analysis, medical record summarization, or HR document processing sits in this risk tier. Running locally eliminates the third-party data processor risk under GDPR Article 28 -- no data leaves the organization.

For German BSI compliance on AI systems processing sensitive documents locally: the recommended configuration is a 7B model at Q4_K_M with 32K context (fits in 9-10 GB RAM on a standard workstation). This provides reliable quality on documents up to 50 pages while keeping all data on-premises. Llama 3.1 8B and Mistral Small 3.1 are the recommended EU compliance choices for long-context document processing.

For French CNIL guidelines on AI and personal data: local inference via Ollama with no external API calls satisfies the requirement that personal data not be processed by third-party AI providers without a valid legal basis.

Japan (METI): Japanese documents require 1.5-2ร— more tokens than equivalent English documents due to tokenizer differences. A 50-page Japanese report may consume 25K-35K tokens -- within the reliable range of Qwen2.5 7B (32K practical limit) but requiring explicit context configuration in Ollama: PARAMETER num_ctx 32768. For Japanese legal and financial documents, Qwen2.5 14B at Q4_K_M with 32K context (~12 GB RAM) provides the best quality-per-RAM for Japanese long-context processing. Qwen2.5's native Japanese tokenizer processes Japanese text 30-40% more efficiently than Llama.

China: Under China's Data Security Law (ๆ•ฐๆฎๅฎ‰ๅ…จๆณ•), processing sensitive documents through cloud APIs requires additional regulatory compliance. Local long-context inference via Qwen2.5 (Alibaba) keeps all document content on-premises. For Chinese enterprise document processing, Qwen2.5 72B with 32K context on a local workstation (~45 GB RAM) provides near-cloud quality at full data sovereignty. Qwen2.5's native Chinese tokenizer makes it 30-40% more token-efficient than Llama for Chinese-language documents.

Common Mistakes with Long Context Local LLMs

  • Assuming 128K context works as well as 4K: The "lost in the middle" effect means information presented 30K-80K tokens ago is retrieved less reliably than information at the start or end. For critical document analysis, chunk long documents into 16K-32K sections and process each separately rather than feeding an entire 100K document at once.
  • Not increasing Ollama's default context size: Ollama defaults to 2048 tokens of context regardless of the model's maximum. A conversation exceeding 2048 tokens will truncate earlier messages. Always set num_ctx explicitly: add PARAMETER num_ctx 32768 to your Modelfile or use --ctx at runtime.
  • Running long context on insufficient RAM: A 7B model with 128K context on 8 GB RAM total causes severe swap usage. Model weights (~4.5 GB) plus 128K KV cache (~8+ GB) exceed 8 GB. Reduce context to 32K (fits ~9 GB) or use 16+ GB RAM for 128K context inference.
  • Forgetting that generation speed is not the only latency factor at long context: At 32K context, the time-to-first-token (TTFT) can be 5-15 seconds on consumer hardware -- the model must process all 32K input tokens before generating a single output token. This prefill phase scales linearly with context length. For interactive use, limit context to 8K-16K. Reserve 32K+ contexts for batch processing where TTFT is acceptable.
  • Using RAG when long context is the correct tool (and vice versa): RAG is better for document search across many documents. Long context is better when you need the model to reason over a complete, coherent document -- a contract, a codebase, a book chapter -- where missing any part would break the analysis. Splitting a 10-page legal contract into RAG chunks can cause cross-reference errors that long context avoids. Choose by task type, not by default preference.

FAQ

Can I summarize an entire book with a local LLM?

A typical 300-page book is 90,000-120,000 words -- approximately 120K-160K tokens. This exceeds the practical reliable context of most 7B models and requires either a 70B model (64K reliable) or chunked processing. For 7B models, split the book into 20K-word chapters and summarize each, then summarize the chapter summaries.

How many pages of text fit in 32K tokens?

Approximately 50-70 pages of standard English text (250 words per page). A 32K token context holds a short novel, a full research paper with appendices, or a complete technical specification document.

Does increasing context length slow down inference?

Yes -- processing a 32K context takes approximately 3-4ร— longer than processing a 4K context on the same hardware, due to the quadratic scaling of attention computation. Generation speed (tokens per second) is not significantly affected, but the time to first token (TTFT) scales with input length.

Which local LLM handles RAG better than long context?

For document search and retrieval tasks, RAG (retrieval-augmented generation) is often more effective than feeding entire documents as context. RAG retrieves the 3-5 most relevant chunks from a large document set and provides only those to the model. This uses 4K-8K tokens of context and avoids the "lost in the middle" problem. Tools like GPT4All LocalDocs and LlamaIndex implement local RAG.

What is the KV cache and why does it grow with context length?

The KV cache (key-value cache) stores attention states for every token processed in the context window. Each token requires a fixed amount of memory for its key and value vectors -- so a 32K context requires 8ร— more KV cache memory than a 4K context. This is why a 7B model at Q4_K_M needs ~6 GB for 4K context but ~9 GB for 32K context. The model weights stay the same -- only the KV cache grows.

Can local models handle 1M token contexts like Gemini 3.1 Pro?

No -- as of April 2026, no locally-runnable model supports 1M token contexts. Gemini 3.1 Pro's 1M token window requires Google's TPU infrastructure. Locally, 128K is the maximum supported by current consumer hardware. For tasks requiring 1M+ token contexts, cloud APIs remain the only practical option.

What is the "lost in the middle" problem and how do I avoid it?

Research shows LLMs reliably retrieve information from the beginning and end of the context window, but miss details from the middle. For a 128K context, content placed at the 40K-80K token mark is most likely to be ignored. To avoid this: either keep important information at the start of the prompt, use RAG to retrieve only relevant chunks, or process long documents in overlapping 16K-32K sections.

How do I check what context length Ollama is using?

Run `ollama show <model>` -- the output lists the parameters including num_ctx. If it shows 2048, Ollama is using the default, not the model's full context window. To change it persistently, create a Modelfile with PARAMETER num_ctx 32768 and run ollama create <name> -f Modelfile. Check active sessions with ollama ps.

Is long context or RAG better for document question-answering?

RAG is usually more effective and RAM-efficient than long context for document Q&A. RAG retrieves 3-5 relevant chunks (4K-8K tokens total) from a large corpus and avoids the "lost in the middle" problem. Long context is better when the model needs to understand the entire document structure or when exact ordering and relationships between sections matter. For most practical document Q&A, start with RAG.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist โ†’

โ† Back to Local LLMs

Ollama Context Configuration: Strix Halo, RTX, Mac โ€” 64Kโ€“1M Tokens