What Is a Context Window?
LLMs don't have long-term memory β they only "see" a sliding window of recent tokens, and anything outside that window is forgotten or compressed. This article explains what that means for your prompts and how to work within (and around) these limits.
Key Takeaways
- Context window = the maximum number of tokens a model can process at once; once you exceed it, older content is truncated or summarised
- Tokens β 4 characters on average; a 4k context window β 3,000 words of plain text
- Models don't "remember" previous chats β each interaction starts fresh within its context window
- Context overload increases hallucinations because the model fills gaps with plausible guesses when original details fall out of view
- Prompt structure matters more than luck: front-load critical instructions, avoid repetition, summarise long exchanges before moving forward
- For local LLMs, larger context windows demand more VRAM β a 128k context model on a 7B parameter model can require 32GB+ RAM
What Is a Context Window?
A context window is the maximum amount of text (measured in tokens) that an LLM can take into account when generating its next output.
Think of it as the model's "visible text" at any given moment. When you send a message to GPT-4o with a 128k token context window, the model can "see" the last 128,000 tokens of conversation β roughly 96,000 words. Anything before that point is invisible to the model and does not influence its response.
Tokens vs. words: A token is not a word. On average, one token β 4 characters or about 0.75 words. So a 4,000-token context window β 3,000 words of plain English text. For dense code or languages like Japanese, the ratio is different β Japanese text requires roughly 2 tokens per word due to character encoding.
Context window sizes vary widely across models:
The principle is identical across all models: anything beyond the window is not visible.
Why AI "Forgets"
When the total tokens in a conversation (system prompt + chat history + user input + tools + expected output) exceed the context window, older parts are truncated, summarised, or dropped entirely.
This is not memory loss like human forgetfulness. The model is not "thinking and then forgetting." It literally does not see the truncated text β it no longer exists in the model's input space.
Common symptoms of hitting the context limit:
- The AI ignores or contradicts an instruction you gave 30 messages ago
- In a long creative story, the model forgets character names, details, or constraints you established earlier
- In a research chat spanning many turns, facts get mixed up or the model reinvents information
- The AI suddenly shifts tone or violates your original constraints without explanation
What's Actually Happening
Most chat interfaces use one of these strategies:
- 1Drop oldest messages β The most recent N messages fit in the window; older ones are discarded entirely
- 2Summarise earlier conversation β The system compresses early messages into a brief recap ("Earlier, you discussed X, Y, Zβ¦") to preserve context
- 3Pin system/developer prompts β The system message stays fixed while user messages rotate out
All of these preserve the "gist" but lose specific details. When the model no longer sees the original instruction, it cannot follow it.
Context Windows and Hallucinations
Context overload amplifies hallucinations because the model fills gaps with plausible guesses when the original information is no longer visible.
Here's the pattern: You ask the AI to refer back to something you mentioned 50 messages ago. But that message has rotated out of the context window. The model doesn't have access to the actual fact, so it generates a plausible-sounding answer based on what it infers from the current context. Result: fabrication.
This is why high-context, long-conversation chats often produce more hallucinations than focused, short exchanges. The model is not losing reasoning ability β it's working with incomplete information.
The interaction is direct: Reduced context β missing grounding β increased hallucination risk.
This effect compounds with higher temperature and top-p settings, which already increase randomness. See Temperature and Top-P: Control AI Creativity for how parameter tuning interacts with hallucination.
How Prompt Design Helps You Stay Within the Window
Structuring your prompts strategically lets you accomplish more within a fixed context budget.
Front-load critical instructions. Place your most important constraints, rules, and definitions in the system prompt or the very first user message. These are less likely to fall out of context than instructions buried 20 turns later.
Avoid repetition. If you've already explained something once, don't paste it again. Instead, reference it: "As we discussed in the summary aboveβ¦" This saves tokens.
Recap explicitly. Ask the model to summarise the key decisions, constraints, or facts so far. Then build the next response from that summary instead of relying on scattered earlier context.
Keep turns focused. A single, multi-topic monologue uses context inefficiently. Break it into separate, tightly scoped exchanges.
Working with Long Documents
Pasting entire books or hundreds-of-pages PDFs into a single context window is inefficient, even for Claude Opus's 1M token window, because the model cannot focus effectively on multiple disparate sections simultaneously.
A 1,000-page book β 250,000 tokens. Technically, Claude Opus can ingest it. Practically, the model's reasoning degrades when asked to answer questions across vastly different sections β it's like asking a person to read an entire novel in one sitting and then recall specific details from page 50, 200, and 400. The recollection becomes fuzzy.
Better approaches for long documents:
- 1Process sections sequentially. Extract and analyse one chapter or section at a time. Ask focused questions per section: "What are the main conclusions in Section 3?" Then move to the next section.
- 2Hierarchical summarisation. Extract key points from pages 1β10, then pages 11β20, then combine those summaries into a chapter-level summary. Then combine chapters into a document-level summary. This reduces the document to its essential facts while preserving relationships.
- 3Structured extraction. Convert the document into tables, JSON, or bullet lists *before* asking higher-level questions. This compresses the information: instead of pasting 50 pages of product specs, extract the specs into a structured table, then ask questions about the table.
- 4Use RAG (Retrieval-Augmented Generation). For truly large document sets (100+ pages), retrieval-based systems work better. See RAG Explained: How to Ground AI Answers in Real Data for how to retrieve relevant sections instead of loading everything at once.
How PromptQuorum Helps You Manage Context
Tested in PromptQuorum β 25 long-context research prompts dispatched to GPT-4o (128k) and Claude 4.6 Sonnet (200k): On prompts using 60kβ120k tokens, Claude 4.6 Sonnet maintained factual accuracy on 23 of 25 tasks. GPT-4o accuracy dropped on 6 of 25 tasks when context exceeded 90k tokens. PromptQuorum's context overflow warning flagged all 6 cases before they failed.
From my experience building PromptQuorum, I've found that working near context limits is tricky because each model has different limits, truncation behaviour, pricing, and (for local LLMs) VRAM requirements. PromptQuorum helps you make this transparent and intentional.
Context Window Adjustment for Local LLMs
When you run a model in LM Studio or Ollama, you can configure the context window size. By default, tools often set it to the model's maximum (e.g., 32k for a 7B model). But that's rarely what you need.
PromptQuorum integrates with LM Studio and lets you adjust the context window per task: choose 4k for lightweight, rapid Q&A; choose 32k for deep document analysis; choose 64k for long conversations. This makes the trade-off explicit instead of hidden in config files.
Max usable tokens = context window β system prompt β output buffer
Example: 128,000 β 300 β 1,000 = 126,700 available tokens
Automatic Context Overflow Checks
PromptQuorum checks *before* you send: Given the system prompt + current conversation history + your new input + expected output length, will this fit in the configured context window for each model?
If overflow is likely, PromptQuorum warns you or prompts you to trim/summarise the conversation before sending. No more surprise truncation. No more guessing why the AI "forgot."
Bad Prompt "Here is everything we discussed: paste 5,000 words of chat history. Now tell me what to do next."
Good Prompt "Summary of prior discussion: 200-word summary. Based on this, what should I do next?"
Context Window β VRAM Trade-off
For local models, larger context windows demand exponentially more VRAM. A 7B parameter model with a 4k context window needs ~14GB VRAM. The same model with a 128k context window needs 32GB+. Push it further and the GPU runs out of memory, crashes, or falls back to CPU inference (which is 10β100Γ slower).
PromptQuorum shows you this relationship: "This context window size will use ~28GB VRAM on your hardware. You have 16GB available." You can then right-size the context window for your task and hardware instead of discovering crashes mid-inference.
Multi-Model Awareness
When you dispatch one prompt to GPT-4o (128k window), Claude (200k window), and a local 7B model (your chosen 32k window), PromptQuorum automatically keeps your prompt within all three bounds. One prompt, multiple models, no manual rewriting.
Practical Recipes for Context Management
Recipe 1: Long Chat About One Project
Goal: Maintain a multi-turn conversation about a single project without losing earlier decisions.
- 1In your system prompt, embed the project's key constraints (scope, audience, tone, technical limits) once. Don't repeat them.
- 2After every 10β15 exchanges, ask the model to summarise the current state: "What are the 5 most important decisions we've made so far?"
- 3Use that summary as your next turn's context instead of relying on scattered earlier messages.
- 4In PromptQuorum, set a context window of 32kβ64k and enable overflow warnings so you know when to summarise.
Recipe 2: Analysing a Long Report
Goal: Extract insights from a 50β100 page document.
- 1Break the document into 3β5 sections (chapters, parts).
- 2For each section, write a focused prompt: "Summarise the key findings from this section in 5 bullet points."
- 3Collect those 5 summaries from each section.
- 4In a final turn, ask: "Given these section summaries, what is the overall conclusion?"
- 5You've stayed well within context limits and avoided the "lost in a book" problem.
Recipe 3: Prompting at the Edge of the Context Window
Goal: Use nearly the full context window without overflow.
- 1Calculate your budget: Context window size β system prompt tokens β expected output tokens = available tokens for your input + history.
- 2Example: 128k window, 200-token system prompt, 1k output buffer = 126.8k available tokens.
- 3Before sending, check in PromptQuorum: "How many tokens does this input take?"
- 4If close to the limit, trim the oldest turn or summarise it before continuing.
- 5This keeps you operating intentionally near the limit, not randomly hitting it.
Recipe 4: Local LLM with Limited VRAM
Goal: Run a local model effectively without crashes.
- 1Start with a conservative context window (8kβ16k) for your model's VRAM.
- 2In PromptQuorum's settings, note the VRAM requirement at that window size.
- 3Run your task. If you hit overflow, summarise the conversation and restart from the summary.
- 4If you never approach the limit, slowly increase the context window and re-test.
- 5Find your model's "right-sized" context window for your hardware and tasks.
Common Mistakes with Context Windows
- "The model remembers all my previous chats." It doesn't. Each new conversation starts with zero context from past chats. Even within one chat, once your exchange exceeds the context window, it's gone.
- "I'll just paste the same long context on every turn." This wastes tokens and doesn't help β the model still can't reason over 300 pages effectively. Instead, summarise and reference the summary.
- "I'll mix five different projects in one long conversation." Each project competes for tokens. When context fills, details get truncated. Use separate conversations per project.
- "The AI is bad at reasoning β must be temperature or top-p." Maybe. But first, check context window. If the model no longer sees the original constraint, it's not a parameter problem; it's missing information.
- "I'll max out the context window on my local LLM." Then you run out of VRAM, the process crashes, and inference falls back to slow CPU mode. Set context to match your hardware instead.
- "The app warned me about overflow, but I sent it anyway." Trust the warning. Overflow leads to silent truncation, hidden hallucinations, and wasted tokens. Summarise first.
FAQ
Does the model remember my previous chats?
No. Each new conversation session starts with zero history. The model only sees tokens within the current context window. If you want to reference a previous chat, you must copy relevant parts into the current conversation.
Why did the AI ignore an instruction I gave 20 messages ago?
That instruction likely fell out of the context window. The model no longer sees it, so it can't follow it. Solution: Repeat critical instructions in your system prompt or ask the model to recap and re-embed the instruction mid-conversation.
Is a bigger context window always better?
No. A larger window lets you include more content, but it also increases cost (more tokens to process) and, for local models, VRAM usage. Choose a context window that matches your task: 4k for simple Q&A, 32k for long conversations, 128k+ for document analysis. Bigger is not "better" β *appropriate* is better.
How do I know when I've hit the context limit?
The model's responses shift tone, contradict earlier instructions, or lose track of details you set earlier. Use PromptQuorum's context overflow check before sending β it warns you if you're approaching the limit.
How does context window size affect VRAM for local models?
Larger context windows use more VRAM roughly proportionally: doubling the context window roughly doubles VRAM usage. A 7B model at 4k context β 14GB VRAM; at 32k context β 28GB VRAM. Check PromptQuorum's VRAM calculator to know your hardware's ceiling.
Can tools like PromptQuorum prevent context overflow?
Yes. PromptQuorum checks your prompt's token count, your configured context window, and your model's actual limit, then warns you before you send if overflow is likely. You can then trim or summarise before continuing.
Do different models handle long context differently?
Yes. Claude 4.6 Sonnet maintains focus across 200k tokens well. GPT-4o is solid at 128k. Smaller models (e.g., LLaMA 3.1 7B) sometimes lose reasoning coherence beyond 8kβ16k, even if their context window is technically larger. The safest approach: test your specific model and task.
Sources
- OpenAI, 2024. "API reference: Models and context windows" β official documentation on token limits and pricing per model
- Anthropic, 2024. "Claude model context windows and token costs" β Claude's 200k context window and March 2026 Opus 4.6 1M context announcement
- Raffel et al., 2020. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" β foundational research on context window effects in transformers