Understanding how context is organized and prioritized is the foundation of token efficiency. When you interact with an AI agent, you are not sending a single message — you are sending a carefully assembled stack of information layered in a specific order. Each layer has a different purpose, a different lifetime, and a different token cost. Mastering this hierarchy means you stop paying for context you do not need and start investing tokens where they deliver the highest return.
The Three Layers of Context
Every LLM interaction is built from three distinct context layers that sit on top of each other within the context window.
System prompts occupy the top of the hierarchy. They define who the agent is, what it is allowed to do, the format it should follow, the tools it has access to, and any non-negotiable constraints. System prompts are processed on every single request — they are the standing order that never changes mid-session unless you deliberately rebuild them. In Claude API terms, this is the system parameter. In the OpenAI API this is the {"role": "system", ...} message. In LangChain, LlamaIndex, and similar orchestration frameworks, it is the prompt template prefix. The system prompt is the most powerful context layer because the model treats it as authoritative instruction, but it is also the most expensive per-session because it is repeated in full on every call.
Persistent context is the layer that survives across multiple turns of a conversation but is not permanently baked into the system prompt. It includes things like: the user's stated preferences gathered during onboarding, the current sprint goal in a development session, the agreed-upon architectural decisions from earlier in the session, a code map of the repository, or a product requirements document. Persistent context typically sits in the conversation history as early assistant or user messages, or it is injected by the orchestration layer at the beginning of each conversation. In frameworks like LangChain this maps to memory objects; in Claude Code this maps to CLAUDE.md files; in OpenAI Assistants API this maps to thread-level context.
Ephemeral context is the layer that lives only for the duration of the current request and is immediately discarded. This includes the user's current question, the output from a tool call, a retrieved code snippet, or a log excerpt being analyzed. Ephemeral context is the most transient and ideally the leanest layer. It contains only what is necessary to answer the immediate question.
The cost equation looks like this:
Total tokens per call = system_prompt_tokens + persistent_context_tokens + ephemeral_context_tokens + output_tokens
If your system prompt is 800 tokens, your persistent context is 2,000 tokens, and your ephemeral context averages 400 tokens per turn, you are paying 3,200 tokens before the model writes a single character of output. In a 50-turn session, the system prompt alone costs 40,000 tokens — the equivalent of a small novel.
Tip: Audit your three context layers separately. Write them out side by side and calculate their token counts independently using a tokenizer like tiktoken (OpenAI) or Anthropic's token counting API. Most teams are shocked to discover that system prompts and persistent context together represent 70–85% of their total token spend, yet are rarely reviewed after initial setup.
How the Model Processes the Context Hierarchy
The model does not treat all context equally. Research and production experience consistently show that information at the beginning and end of the context window is attended to most strongly, while information in the middle of a long context tends to receive less attention — the "lost in the middle" effect documented by Liu et al. (2023).
This attention pattern has direct implications for how you should structure your context hierarchy:
- Place the most critical, non-negotiable instructions in the system prompt (beginning of context). The model treats these as authoritative constraints.
- Place the most immediately relevant content at the end of the context (the most recent user turn and the retrieved ephemeral context). The model applies strongest attention to its immediate task context.
- Avoid placing critical information only in the middle of long conversation histories. Important constraints or facts buried in turn 12 of a 30-turn conversation will be less reliably followed than the same information in the system prompt.
When building agentic systems with Claude Code, the model reads CLAUDE.md files at different directory levels. The root CLAUDE.md establishes the top-level persistent context; subdirectory CLAUDE.md files provide scoped persistent context for specific areas of the codebase. This mirrors the hierarchy principle: broader context higher in the hierarchy, narrower context closer to the task.
Root CLAUDE.md → persistent context for whole project
src/api/CLAUDE.md → persistent context scoped to the API module
src/api/routes/CLAUDE.md → persistent context for routing logic only
Tip: When debugging why an AI agent is ignoring an instruction, check which layer of the hierarchy that instruction lives in. Instructions buried in the middle of persistent context are far less reliable than instructions in the system prompt. Move non-negotiable constraints to the system layer.
The Token Cost of Each Layer Over a Session Lifetime
The hierarchy is not just a conceptual model — it is a token accounting framework. Consider a practical agentic coding session:
Session: 30 API calls, building a REST API endpoint
| Layer | Tokens per call | Total across 30 calls |
|---|---|---|
| System prompt | 650 | 19,500 |
| Persistent context (CLAUDE.md + repo map) | 1,800 | 54,000 |
| Ephemeral context (current files + question) | 600 avg | 18,000 |
| Output | 400 avg | 12,000 |
| Total | 3,450 | 103,500 |
In this example, the system prompt and persistent context together account for 71% of all input tokens. If you could cut these two layers by 40%, you would save nearly 30,000 tokens — the equivalent of removing 10 full turns of conversation.
Now consider what happens when persistent context grows over the session — as conversation history accumulates, prior tool outputs remain in context, and the model is also sent previously generated code. The persistent layer can balloon from 1,800 to 8,000+ tokens by turn 20. This is context drift, and it is addressed in depth in Module 5 (Context Compression and Summarization) and Module 6 (Multi-Turn Conversation Optimization). Understanding the hierarchy tells you where the problem is; later modules tell you how to fix it.
Tip: Use a token counting middleware layer in your orchestration code that logs the token count for each layer on every call. Many teams use a simple wrapper around their LLM client that calls the tokenizer and emits metrics to their observability stack (Datadog, Grafana, LangSmith, or Helicone). This telemetry is essential for identifying which layer is growing out of control.
Persistent Context Design Patterns
Designing persistent context well is an exercise in deciding what the agent always needs to know vs. what it only needs to know sometimes. Here are four patterns that work well in production:
Pattern 1: The Invariant Facts Block
Include only facts that are guaranteed to be true for the entire session — project name, tech stack, code style guide reference, language, framework version. Keep this under 200 tokens.
Project: PaymentService
Stack: Node.js 20, TypeScript 5, PostgreSQL 15, Prisma ORM
Style: ESLint + Prettier, follow existing patterns in codebase
Test framework: Vitest
Deployment: AWS Lambda via SST v3
Pattern 2: The Session Goal Block
Include the current session's objective in 2–4 sentences. This is ephemeral at the macro level (changes each session) but persistent within a session.
Session goal: Implement the subscription billing webhook handler for Stripe events.
Focus on: event idempotency, error handling, and replay safety.
Out of scope: frontend billing UI, refund logic.
Pattern 3: The Architectural Decisions Log
A compact list of key decisions made in earlier turns. Updated by appending new decisions as the session progresses, with old decisions occasionally summarized or pruned.
Decisions:
- Use Prisma transactions for webhook idempotency (not Redis locks)
- Webhook events stored in webhook_events table before processing
- Retry via AWS SQS dead-letter queue, not in-process
Pattern 4: Reference Pointers, Not Full Content
Instead of including full files, include a structured reference that tells the agent where to look:
Key files:
- src/webhooks/stripe.ts — main handler (read before modifying)
- src/db/schema.prisma — webhook_events table definition
- tests/webhooks/stripe.test.ts — existing test patterns to follow
Tip: Review your persistent context at the start of every major task type you have. Product managers working on user story writing have fundamentally different persistent context needs than engineers debugging a production incident. Build separate persistent context templates for each major workflow, and load only the relevant one.
Ephemeral Context: Keeping It Sharp
Ephemeral context is where most teams have the most room to improve. The temptation is to include entire files, full stack traces, complete API responses — everything that might be relevant. This approach reliably produces worse results at higher cost because:
- The model has to attend to a larger context window, increasing the chance of attending to irrelevant content.
- More tokens means higher latency and cost.
- Relevant information is diluted by surrounding noise.
Effective ephemeral context design follows a principle of minimum sufficient information: include exactly what the model needs to answer the current question, and nothing else.
Before asking about a bug in a specific function:
Bloated ephemeral context (2,400 tokens):
The entire 400-line file containing the buggy function, plus the full stack trace, plus the complete error log from the last hour.
Lean ephemeral context (380 tokens):
The 35-line function with the bug, the specific error message (3 lines), and the single log line showing the input that triggered the failure.
The lean version provides the model with sharper signal and achieves better results in practice because there is less noise competing for attention.
Tip: Build ephemeral context assembly as an explicit step in your orchestration pipeline. Before calling the LLM, pass your candidate context through a "context trimmer" function that applies rules like: truncate stack traces to first 20 lines, extract only the relevant function from files over 100 lines, summarize API responses over 500 tokens. LangChain's document loaders with custom splitters and LlamaIndex's context selection tools both support this pattern.
Practical Hierarchy Mapping Exercise
To internalize the hierarchy, do this exercise on your current project:
- Open your current AI tool configuration (CLAUDE.md, .cursorrules, system prompt in your API integration, or the LangChain prompt template).
- Copy the full text into a tokenizer (use
tiktokenwithcl100k_baseencoding for most models, or Anthropic's/v1/messages/count_tokensendpoint for Claude). - Label each paragraph as: system layer, persistent layer, or ephemeral layer.
- For each persistent layer item, ask: "Does the model need this on every single call, or just sometimes?"
- Move "just sometimes" content out of the persistent layer into a conditional injection pattern (covered in depth in Topic 5 of this module).
This exercise consistently surfaces between 20–40% of persistent context that can be made conditional without any loss of quality, saving tokens on every call where that context is not needed.
Tip: Use a spreadsheet to track your context hierarchy audit results. Columns: Layer, Content Description, Token Count, Required Every Call (Y/N), Action (Keep/Conditionalize/Remove). Share this with your team — context hierarchy decisions affect everyone who uses the agent, and collective ownership of the context design leads to dramatically better outcomes than having a single person manage it silently.