The context window is the single most important constraint in LLM system design. It defines the maximum amount of information a model can hold simultaneously, but raw capacity numbers — "200k tokens!" — are deeply misleading. A model that can technically hold 200,000 tokens does not reliably use all of them equally. This topic explains the mechanics of context windows, why large windows create false confidence, and how to design systems that work with the model's actual attention patterns rather than against them.
What the Context Window Actually Is
The context window is the total number of tokens — input and output combined — that a model can process in a single forward pass. Every token in the window is available for the model's attention mechanism to reference when generating each output token.
Current context window sizes (May 2026):
| Model | Context Window | Practical Effective Window |
|---|---|---|
| GPT-4o | 128,000 tokens | ~80,000 tokens reliably |
| GPT-4o mini | 128,000 tokens | ~80,000 tokens reliably |
| Claude Opus 4 | 200,000 tokens | ~150,000 tokens reliably |
| Claude Sonnet 4 | 200,000 tokens | ~150,000 tokens reliably |
| Claude Haiku 3.5 | 200,000 tokens | ~150,000 tokens reliably |
| Gemini 2.0 Flash | 1,048,576 tokens (1M) | Variable; degrades beyond ~500k |
| Gemini 2.5 Pro | 1,048,576 tokens (1M) | Best-in-class long-context performance |
| Llama 3.1 70B | 128,000 tokens | ~60,000 tokens reliably |
The "practical effective window" is the range where research consistently shows high retrieval accuracy (>90%) on needle-in-a-haystack benchmarks. Beyond this range, models still technically function but show measurable degradation on information retrieval tasks.
How the context window fills up in practice:
Context window: 200,000 tokens
System prompt: [████░░░░░░░░░░░░░░░░] 1,500 tokens
Tool definitions (×12): [████████░░░░░░░░░░░░] 3,200 tokens
Conversation history (20 turns): [████████████░░░░░░░░] 8,400 tokens
Retrieved codebase context: [████████████████████] 45,000 tokens
Current task / user request: [██░░░░░░░░░░░░░░░░░░] 800 tokens
───────────────────────────────────────────────────────
Total used: 58,900 tokens (29.5%)
Remaining: 141,100 tokens (70.5%)
At first glance, 70% of the context window is empty — plenty of room. But this view is deceptive: the model's ability to attend to the existing 58,900 tokens is already non-uniform, and adding more content will push critical information toward the "lost" zone.
Tip: Do not think of the context window as a uniform storage container. Think of it as a stage with spotlight and shadow zones. Content at the very beginning (start of the prompt) and very end (most recent content) is reliably in the spotlight. Content in the middle is progressively shadowed. Your task is to engineer which content lands in the spotlight.
The "Lost in the Middle" Problem — Research and Reality
The "Lost in the Middle" problem was formally documented in a 2023 Stanford research paper by Liu et al. and has been replicated many times since. The finding: language model performance on retrieval tasks degrades significantly for content positioned in the middle of a long context, even when that content is explicitly present and relevant.
The attention U-curve:
Retrieval accuracy
100% │ ██
│ ████
│ ██ ██
70% │ ██ ██
│────────────────────────────────
│ ████████████████
40% │ ████████████████████
│
└─────────────────────────────────
Start End
Position in context
Models pay the most attention to:
1. The very beginning of the context (primacy effect — the model's first pass "primes" attention)
2. The very end of the context (recency effect — nearest to output generation)
Models pay the least attention to:
1. Content in the middle 40–60% of a long context
2. Content that is structurally similar to surrounding content (the model has trouble differentiating it)
This effect persists even in modern long-context models. While Claude, GPT-4o, and Gemini have all improved vs. the 2023 baseline, the fundamental U-curve remains. Practical benchmarks from engineering teams consistently show that:
- A relevant code file buried in position 40 of 60 files is retrieved accurately ~65% of the time
- The same file placed last in the context is retrieved accurately ~95% of the time
- The same file placed first is retrieved accurately ~90% of the time
Testing the "lost in the middle" effect on your own system:
import anthropic
import random
def needle_in_haystack_test(
needle: str,
haystack_documents: list[str],
needle_positions: list[float], # 0.0 = first, 1.0 = last
question: str,
correct_answer: str,
) -> dict:
"""
Test how well a model retrieves specific information at different context positions.
"""
client = anthropic.Anthropic()
results = {}
for position in needle_positions:
docs = haystack_documents.copy()
insert_idx = int(position * len(docs))
docs.insert(insert_idx, needle)
# Build context
context = "\n\n---\n\n".join(
f"Document {i+1}:\n{doc}" for i, doc in enumerate(docs)
)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200,
messages=[{
"role": "user",
"content": f"{context}\n\n---\n\nQuestion: {question}\n"
f"Answer based only on the documents above."
}]
)
answer = response.content[0].text
is_correct = correct_answer.lower() in answer.lower()
results[f"position_{position:.0%}"] = {
"correct": is_correct,
"answer": answer[:100],
"input_tokens": response.usage.input_tokens
}
return results
needle = "Document: API Configuration\nThe rate limit for the search endpoint is 47 requests per second."
haystack = [f"Document {i}: Generic service documentation paragraph {i}..." for i in range(20)]
question = "What is the rate limit for the search endpoint?"
correct = "47"
results = needle_in_haystack_test(
needle, haystack, [0.0, 0.25, 0.5, 0.75, 1.0], question, correct
)
for pos, result in results.items():
status = "PASS" if result["correct"] else "FAIL"
print(f"{pos}: {status} — {result['answer'][:60]}")
Tip: Run needle-in-a-haystack tests on your actual system prompts and retrieved contexts before deploying to production. Many teams discover that their agent is silently missing critical instructions buried in a 5,000-token system prompt. If a critical instruction fails the test at any middle position, move it to the very beginning or very end of the system prompt — do not rely on the model "finding it" in the middle.
Context Window Architecture — Structural Design Patterns
Given the U-curve attention pattern, the architecture of what goes where in the context window is a critical engineering decision.
The optimal context structure:
[POSITION 1 — Spotlight Zone: System/Primacy]
├── Core identity and task definition (most critical instructions)
├── Output format requirements (what the model MUST produce)
└── Hard constraints and guardrails (what the model MUST NOT do)
[POSITION 2 — Middle Zone: Supporting Content]
├── Background information (still important but less attention-sensitive)
├── Historical conversation context
├── Retrieved documents (sorted by relevance — least relevant first)
└── Tool definitions (they are read, then largely ignored during generation)
[POSITION 3 — Spotlight Zone: Recency]
├── Most relevant retrieved document(s) — place your BEST content here
├── Most recent conversation turns
└── The actual user request / current task
Why this ordering works:
- Critical instructions at position 1 benefit from primacy attention
- The actual question at the very end benefits from recency attention
- The model is most likely to produce output that directly responds to the last content it processes
- Less critical supporting context in the middle is still available but does not need perfect retrieval
Implementing position-aware context assembly:
from typing import NamedTuple
class ContextBlock(NamedTuple):
content: str
priority: int # 1 = primacy zone, 2 = middle, 3 = recency zone
label: str
def assemble_context(blocks: list[ContextBlock]) -> str:
"""
Assemble context blocks in optimal attention order:
- Priority 1 blocks first (primacy zone)
- Priority 2 blocks in middle (sorted by relevance score, least relevant first)
- Priority 3 blocks last (recency zone)
"""
primacy = [b for b in blocks if b.priority == 1]
middle = [b for b in blocks if b.priority == 2]
recency = [b for b in blocks if b.priority == 3]
ordered = primacy + middle + recency
return "\n\n".join(f"<!-- {b.label} -->\n{b.content}" for b in ordered)
blocks = [
ContextBlock(
content="You are a senior security engineer. Review code for vulnerabilities. "
"ALWAYS check for: SQL injection, XSS, insecure deserialization, "
"hardcoded secrets. Return findings as structured JSON.",
priority=1,
label="SYSTEM_INSTRUCTIONS"
),
ContextBlock(
content="[Project coding standards — 800 tokens]",
priority=2,
label="CODING_STANDARDS"
),
ContextBlock(
content="[Related files context — low relevance — 1200 tokens]",
priority=2,
label="RELATED_FILES_LOW"
),
ContextBlock(
content="[Related files context — high relevance — 800 tokens]",
priority=3,
label="RELATED_FILES_HIGH"
),
ContextBlock(
content="[The PR diff being reviewed — 2000 tokens]",
priority=3,
label="CURRENT_PR_DIFF"
),
ContextBlock(
content="Review the above PR diff for security vulnerabilities.",
priority=3,
label="USER_REQUEST"
),
]
context = assemble_context(blocks)
Tip: When using RAG (retrieval-augmented generation) in your agentic system, place the highest-relevance retrieved documents last — immediately before the user query. This counteracts the lost-in-the-middle effect and can improve retrieval accuracy by 20–30% with zero additional tokens.
Context Window Limits and Graceful Degradation
Context windows overflow. When your accumulated history, retrieved context, and current request exceed the window, you need a strategy — not a crash.
Detection: Know before you hit the limit
import tiktoken
from typing import Callable
class ContextWindowManager:
def __init__(
self,
model: str = "claude-sonnet-4-5",
max_tokens: int = 200_000,
safety_margin_pct: float = 0.10, # Reserve 10% for output
on_approaching_limit: Callable | None = None
):
self.model = model
self.max_tokens = max_tokens
self.safety_margin = int(max_tokens * safety_margin_pct)
self.effective_limit = max_tokens - self.safety_margin
self.on_approaching_limit = on_approaching_limit
self._enc = tiktoken.get_encoding("cl100k_base")
def count(self, text: str) -> int:
return len(self._enc.encode(text))
def check_and_trim(
self,
system: str,
messages: list[dict],
tools_tokens: int = 0
) -> list[dict]:
"""
Check if messages fit in context. If not, trim oldest messages
(preserving the first exchange as anchoring context).
Returns trimmed message list.
"""
system_tokens = self.count(system)
fixed_overhead = system_tokens + tools_tokens
available = self.effective_limit - fixed_overhead
if available < 1000:
raise ValueError(
f"System prompt + tools ({fixed_overhead} tokens) consume most of context. "
f"Consider reducing system prompt or dynamic tool loading."
)
# Count all message tokens
messages = list(messages)
while messages:
total = sum(self.count(m.get("content", "")) + 4 for m in messages)
if total <= available:
break
# Alert if approaching limit
if self.on_approaching_limit and total > available * 0.8:
self.on_approaching_limit(total, available)
# Trim: remove 2nd and 3rd messages (keep first exchange for anchor)
if len(messages) > 4:
messages.pop(1)
messages.pop(1) # now index 1 again (was index 2)
else:
# Can't trim further without losing critical context
# Summarize instead (hook for summarization module)
print("Warning: Context approaching limit. Consider summarization.")
break
return messages
def utilization_report(
self,
system: str,
messages: list[dict],
tools_tokens: int = 0
) -> dict:
system_tokens = self.count(system)
msg_tokens = sum(self.count(m.get("content", "")) + 4 for m in messages)
total = system_tokens + msg_tokens + tools_tokens
return {
"total_tokens": total,
"max_tokens": self.max_tokens,
"utilization_pct": total / self.max_tokens * 100,
"remaining_tokens": self.max_tokens - total,
"breakdown": {
"system": system_tokens,
"messages": msg_tokens,
"tools": tools_tokens,
}
}
Tip: Do not wait until you hit the context limit to think about what to drop. Design your context trimming strategy upfront. The most common approaches are: (1) sliding window — drop oldest messages first; (2) summarize-and-compress — replace old messages with a summary; (3) checkpoint-and-restart — save state and begin a new session. Each has different trade-offs, covered in depth in Module 5.
Attention Concentration — The Quality Implications of Context Size
Beyond the lost-in-the-middle problem, there is a subtler issue: attention dilution. As the context grows, the model's attention is spread across more tokens. This means:
- Instructions get diluted — A 200-word instruction in a 2,000-token context has high attention weight. The same instruction in a 50,000-token context has lower weight relative to other content.
- Contradictions amplify — In long contexts, the model may encounter contradictory information and fail to resolve it correctly. Shorter contexts with cleaner information are less susceptible to this.
- Format compliance degrades — Models are less reliable about following output format instructions when those instructions are buried deep in a long context. This is especially problematic for agentic systems that parse structured output.
Measuring format compliance degradation:
A simple experiment: send the same output format instruction at different positions in a growing context and measure how often the model produces correctly structured JSON.
Context size | Instructions at START | Instructions at END
2,000 tokens | 98% compliance | 97% compliance
10,000 tokens | 95% compliance | 96% compliance
30,000 tokens | 88% compliance | 94% compliance
80,000 tokens | 79% compliance | 91% compliance
150,000 tokens | 71% compliance | 89% compliance
This pattern shows why placing output format requirements at the end of the prompt (recency zone) maintains higher compliance as context grows.
Mitigation strategies:
- Use response prefilling (Claude) or response format (OpenAI) to hard-constrain output structure at the API level, not relying on prompt instructions alone
- Repeat critical format instructions immediately before the user's request (recency zone)
- Validate output structure with a schema checker and re-prompt if validation fails
import json
from pydantic import BaseModel, ValidationError
class ReviewOutput(BaseModel):
severity: str
issues: list[str]
recommendation: str
def call_with_retry(client, prompt: str, max_retries: int = 3) -> ReviewOutput:
for attempt in range(max_retries):
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": prompt + "\n\nRespond ONLY with a JSON object matching: "
'{"severity": "P0|P1|P2|P3", "issues": [...], "recommendation": "..."}'
}]
)
try:
data = json.loads(response.content[0].text)
return ReviewOutput(**data)
except (json.JSONDecodeError, ValidationError) as e:
if attempt == max_retries - 1:
raise
print(f"Attempt {attempt+1} failed validation: {e}. Retrying...")
raise RuntimeError("Failed to get valid structured output after retries")
Tip: For QA engineers building test automation on top of LLM agents: always implement output schema validation before any downstream processing. A model that silently returns malformed JSON because the format instruction was lost in the middle of a long context will cause hard-to-diagnose failures in your pipeline. Treat structural output validation as a mandatory quality gate, not optional defensive programming.
Context Window Benchmarking — Measuring Your Model's Effective Range
Before you rely on a model's stated context window for production use, benchmark its actual performance on your task type. The "NIAH" (Needle in a Haystack) benchmark is the standard approach.
Running a NIAH benchmark against your production context patterns:
import anthropic
import random
import json
from datetime import datetime
def run_niah_benchmark(
model: str,
context_sizes: list[int], # in tokens, e.g. [5000, 20000, 50000, 100000]
positions: list[float], # e.g. [0.1, 0.25, 0.5, 0.75, 0.9]
trials_per_condition: int = 5
) -> dict:
"""
Benchmark retrieval accuracy across context sizes and positions.
"""
client = anthropic.Anthropic()
results = {}
# Generate filler content (~4 chars per token)
filler_unit = "This is a placeholder document containing technical information. " * 20
for ctx_size in context_sizes:
results[ctx_size] = {}
for position in positions:
correct = 0
for trial in range(trials_per_condition):
# Create unique needle
secret = f"SECRET_VALUE_{random.randint(10000, 99999)}"
needle = f"\n\nCRITICAL_CONFIG: The authentication token is {secret}\n\n"
# Build filler to target context size
filler_tokens = ctx_size - 50 # Leave room for needle + question
filler = (filler_unit * (filler_tokens // len(filler_unit.split()) + 1))
filler_words = filler.split()
# Insert needle at position
insert_at = int(position * len(filler_words))
words_with_needle = (
filler_words[:insert_at]
+ needle.split()
+ filler_words[insert_at:]
)
full_context = " ".join(words_with_needle)
response = client.messages.create(
model=model,
max_tokens=50,
messages=[{
"role": "user",
"content": f"{full_context}\n\nWhat is the authentication token "
f"from CRITICAL_CONFIG? Reply with only the token value."
}]
)
answer = response.content[0].text.strip()
if secret in answer:
correct += 1
accuracy = correct / trials_per_condition
results[ctx_size][position] = accuracy
print(f" {ctx_size:>8} tokens, position {position:.0%}: {accuracy:.0%} accuracy")
return results
print(f"NIAH Benchmark — {datetime.now().strftime('%Y-%m-%d')}")
benchmark_results = run_niah_benchmark(
model="claude-sonnet-4-5",
context_sizes=[5_000, 20_000, 50_000, 100_000],
positions=[0.1, 0.5, 0.9],
trials_per_condition=3
)
Tip: Run this benchmark using your actual production context types — not generic lorem ipsum. If your agent processes code, use real code files as filler. If it processes support tickets, use real ticket text. The model's retrieval performance varies by content type because its training data distribution affects attention patterns. A benchmark on representative content will reveal whether your specific use case is affected by lost-in-the-middle degradation.