Key Takeaways
- Llama 4 Scout (MoE) supports up to 10M tokens. DeepSeek V4-Flash and Qwen 3.6 support 1M and 256K tokens respectively (extendable to 1M via YaRN). May 2026 marks the first generation of million-token-capable open models.
- Practical context by model size: 7B-8B models maintain quality at 16K-32K tokens. 70B+ models and MoE models extend this to 256K-1M tokens. Llama 4 Scout can handle full million-token contexts on sufficient VRAM.
- RAM scales with context length AND model size. Qwen 3.6 27B at Q4_K_M needs ~22 GB at 128K, ~65+ GB at 1M tokens. Llama 4 Scout needs 150+ GB for full 10M context.
- Lost in the Middle still applies: LLMs miss details from middle sections of the context. Mitigation: keep critical info at prompt start, use RAG for search, or process in overlapping chunks.
- Long context excels for holistic analysis of complete documents (codebases, contracts, books). RAG excels for search-heavy tasks across many documents. Choose by task type, not context size alone.
- Ollama defaults to 2048 tokens -- not 128K or 1M. Set num_ctx explicitly in a Modelfile to access full context. For massive contexts (500K+), tune attention implementation to avoid OOM.
What Is Context Length and Why Does It Matter for Local LLMs?
Context length is the maximum number of tokens a model can process in a single inference call -- the combined size of the input (your document, conversation history, system prompt) and the output (the model's response). One token โ 0.75 words in English; 128K tokens โ 96,000 words.
For local LLM use cases, long context enables: summarizing entire books or long reports, analyzing full codebases in one prompt, processing hours of meeting transcripts, and maintaining long conversation histories without losing earlier context.
The key distinction is between the advertised context length (what the model architecture supports) and the practical context length (where quality stays reliable). A model may technically support 128K tokens but show degraded quality on information presented at the 100K token mark.
Which Local LLMs Support 128K Token Context in 2026?
| Model | Context Window | Practical Limit | Ollama Command |
|---|---|---|---|
| Llama 3.1 8B | 128K | ~32K reliable | ollama run llama3.2 |
| Llama 3.2 3B | 128K | ~16K reliable | ollama run llama3.2:3b |
| Llama 3.3 70B | 128K | ~64K reliable | ollama run llama3.3:70b |
| Qwen2.5 7B | 128K | ~32K reliable | ollama run qwen2.5:7b |
| Qwen2.5 72B | 128K | ~64K reliable | ollama run qwen2.5:72b |
| Mistral Small 3.1 24B | 128K | ~32K reliable | ollama run mistral-small3.1 |
| Gemma 2 2B | 8K | ~6K reliable | ollama run gemma2:2b |
| Mistral 7B v0.3 | 32K | ~16K reliable | ollama run llama3.2 |
How Much RAM Does Long Context Processing Require?
RAM usage scales with both model size and context length. The KV cache (key-value cache) stores attention states for all processed tokens -- this grows linearly with context length.
As of April 2026, a 7B model at Q4_K_M with 4K context uses ~6 GB RAM. The same model with 32K context uses ~8-9 GB RAM. With 128K context: ~12-16 GB RAM.
| Model | 4K Context | 32K Context | 128K Context |
|---|---|---|---|
| Llama 3.1 8B Q4_K_M | ~6 GB | ~9 GB | ~14 GB |
| Qwen2.5 14B Q4_K_M | ~9 GB | ~12 GB | ~18 GB |
| Mistral Small 3.1 24B Q4_K_M | ~14 GB | ~17 GB | ~24 GB |
| Llama 3.3 70B Q4_K_M | ~40 GB | ~45 GB | ~55 GB |
Why Is Practical Context Length Shorter Than the Advertised Maximum?
LLMs trained with RoPE positional encodings (used by Llama, Qwen, Mistral) can technically process tokens up to their maximum context length, but quality degrades in a known pattern called the "lost in the middle" effect.
Research shows that language models are best at using information at the beginning and end of the context window. Information placed in the middle of a very long context is retrieved less reliably. In practice, this means a model with a 128K context window may reliably answer questions about content in the first 32K tokens and last 16K tokens, but miss details from the 40K-80K token range.
For local models specifically, the practical reliable limit scales with model size: 3B models โ 8K-16K reliable; 7B-8B models โ 16K-32K reliable; 70B models โ 64K reliable. These are approximate -- the actual limit depends on the specific task and how "important" the retrieved information is.
Long context windows enable more input, but prompt structure determines whether the model uses that context effectively. Techniques like RAG, prompt chaining, and context window management strategies are covered in the prompt engineering guide.
How Do You Set Context Length in Ollama?
Ollama defaults to 2048 tokens of context unless configured otherwise. To use a model's full context window:
Context window size determines how much text a model can process, but prompt structure determines how effectively it uses that context. For a deep dive into why models lose track of earlier input and strategies to mitigate it, see context windows explained: why AI forgets.
# Set context length at runtime
ollama run llama3.2 --ctx 32768
# Or create a custom model with a Modelfile
cat << EOF > Modelfile
FROM llama3.1:8b
PARAMETER num_ctx 32768
EOF
ollama create llama3.1-32k -f Modelfile
ollama run llama3.1-32kLong Context Local LLMs: Regional Context
EU / GDPR + AI Act: The EU AI Act (effective February 2025) classifies AI systems processing personal data at scale as potentially high-risk. Long-context local inference for legal document analysis, medical record summarization, or HR document processing sits in this risk tier. Running locally eliminates the third-party data processor risk under GDPR Article 28 -- no data leaves the organization.
For German BSI compliance on AI systems processing sensitive documents locally: the recommended configuration is a 7B model at Q4_K_M with 32K context (fits in 9-10 GB RAM on a standard workstation). This provides reliable quality on documents up to 50 pages while keeping all data on-premises. Llama 3.1 8B and Mistral Small 3.1 are the recommended EU compliance choices for long-context document processing.
For French CNIL guidelines on AI and personal data: local inference via Ollama with no external API calls satisfies the requirement that personal data not be processed by third-party AI providers without a valid legal basis.
Japan (METI): Japanese documents require 1.5-2ร more tokens than equivalent English documents due to tokenizer differences. A 50-page Japanese report may consume 25K-35K tokens -- within the reliable range of Qwen2.5 7B (32K practical limit) but requiring explicit context configuration in Ollama: PARAMETER num_ctx 32768. For Japanese legal and financial documents, Qwen2.5 14B at Q4_K_M with 32K context (~12 GB RAM) provides the best quality-per-RAM for Japanese long-context processing. Qwen2.5's native Japanese tokenizer processes Japanese text 30-40% more efficiently than Llama.
China: Under China's Data Security Law (ๆฐๆฎๅฎๅ จๆณ), processing sensitive documents through cloud APIs requires additional regulatory compliance. Local long-context inference via Qwen2.5 (Alibaba) keeps all document content on-premises. For Chinese enterprise document processing, Qwen2.5 72B with 32K context on a local workstation (~45 GB RAM) provides near-cloud quality at full data sovereignty. Qwen2.5's native Chinese tokenizer makes it 30-40% more token-efficient than Llama for Chinese-language documents.
Common Mistakes with Long Context Local LLMs
- Assuming 128K context works as well as 4K: The "lost in the middle" effect means information presented 30K-80K tokens ago is retrieved less reliably than information at the start or end. For critical document analysis, chunk long documents into 16K-32K sections and process each separately rather than feeding an entire 100K document at once.
- Not increasing Ollama's default context size: Ollama defaults to 2048 tokens of context regardless of the model's maximum. A conversation exceeding 2048 tokens will truncate earlier messages. Always set num_ctx explicitly: add PARAMETER num_ctx 32768 to your Modelfile or use --ctx at runtime.
- Running long context on insufficient RAM: A 7B model with 128K context on 8 GB RAM total causes severe swap usage. Model weights (~4.5 GB) plus 128K KV cache (~8+ GB) exceed 8 GB. Reduce context to 32K (fits ~9 GB) or use 16+ GB RAM for 128K context inference.
- Forgetting that generation speed is not the only latency factor at long context: At 32K context, the time-to-first-token (TTFT) can be 5-15 seconds on consumer hardware -- the model must process all 32K input tokens before generating a single output token. This prefill phase scales linearly with context length. For interactive use, limit context to 8K-16K. Reserve 32K+ contexts for batch processing where TTFT is acceptable.
- Using RAG when long context is the correct tool (and vice versa): RAG is better for document search across many documents. Long context is better when you need the model to reason over a complete, coherent document -- a contract, a codebase, a book chapter -- where missing any part would break the analysis. Splitting a 10-page legal contract into RAG chunks can cause cross-reference errors that long context avoids. Choose by task type, not by default preference.
FAQ
Can I summarize an entire book with a local LLM?
A typical 300-page book is 90,000-120,000 words -- approximately 120K-160K tokens. This exceeds the practical reliable context of most 7B models and requires either a 70B model (64K reliable) or chunked processing. For 7B models, split the book into 20K-word chapters and summarize each, then summarize the chapter summaries.
How many pages of text fit in 32K tokens?
Approximately 50-70 pages of standard English text (250 words per page). A 32K token context holds a short novel, a full research paper with appendices, or a complete technical specification document.
Does increasing context length slow down inference?
Yes -- processing a 32K context takes approximately 3-4ร longer than processing a 4K context on the same hardware, due to the quadratic scaling of attention computation. Generation speed (tokens per second) is not significantly affected, but the time to first token (TTFT) scales with input length.
Which local LLM handles RAG better than long context?
For document search and retrieval tasks, RAG (retrieval-augmented generation) is often more effective than feeding entire documents as context. RAG retrieves the 3-5 most relevant chunks from a large document set and provides only those to the model. This uses 4K-8K tokens of context and avoids the "lost in the middle" problem. Tools like GPT4All LocalDocs and LlamaIndex implement local RAG.
What is the KV cache and why does it grow with context length?
The KV cache (key-value cache) stores attention states for every token processed in the context window. Each token requires a fixed amount of memory for its key and value vectors -- so a 32K context requires 8ร more KV cache memory than a 4K context. This is why a 7B model at Q4_K_M needs ~6 GB for 4K context but ~9 GB for 32K context. The model weights stay the same -- only the KV cache grows.
Can local models handle 1M token contexts like Gemini 3.1 Pro?
No -- as of April 2026, no locally-runnable model supports 1M token contexts. Gemini 3.1 Pro's 1M token window requires Google's TPU infrastructure. Locally, 128K is the maximum supported by current consumer hardware. For tasks requiring 1M+ token contexts, cloud APIs remain the only practical option.
What is the "lost in the middle" problem and how do I avoid it?
Research shows LLMs reliably retrieve information from the beginning and end of the context window, but miss details from the middle. For a 128K context, content placed at the 40K-80K token mark is most likely to be ignored. To avoid this: either keep important information at the start of the prompt, use RAG to retrieve only relevant chunks, or process long documents in overlapping 16K-32K sections.
How do I check what context length Ollama is using?
Run `ollama show <model>` -- the output lists the parameters including num_ctx. If it shows 2048, Ollama is using the default, not the model's full context window. To change it persistently, create a Modelfile with PARAMETER num_ctx 32768 and run ollama create <name> -f Modelfile. Check active sessions with ollama ps.
Is long context or RAG better for document question-answering?
RAG is usually more effective and RAM-efficient than long context for document Q&A. RAG retrieves 3-5 relevant chunks (4K-8K tokens total) from a large corpus and avoids the "lost in the middle" problem. Long context is better when the model needs to understand the entire document structure or when exact ordering and relationships between sections matter. For most practical document Q&A, start with RAG.
Sources
- Lost in the Middle: How Language Models Use Long Contexts -- Liu et al., 2023
- Ollama Context Length Configuration -- Ollama documentation
- Llama 3.1 Technical Report -- Meta AI, 2024
- EU AI Act Official Text -- European Parliament, 2024