Introduction: The Agentic Token Budget Problem
Agentic AI systems are fundamentally different from single-turn interactions. When you send a one-shot prompt and receive a response, token consumption is straightforward: input tokens plus output tokens. In an agentic loop, the same context must be re-sent on every iteration, tool results are injected back into the conversation, and a single task can accumulate thousands of tokens per iteration before any real work gets done.
Understanding exactly where tokens are consumed in a plan-execute-verify cycle is the prerequisite to every optimization technique in this module. You cannot optimize what you have not measured, and you cannot measure what you do not understand architecturally.
This topic tears open an agentic loop and accounts for every token category. By the end, you will be able to look at any agentic system — whether built on LangGraph, CrewAI, AutoGen, or Claude Code — and identify the biggest sources of waste.
The Three Phases of an Agentic Loop and Their Token Costs
Every agentic loop, regardless of framework, cycles through three logical phases: Plan, Execute, and Verify. Each phase has a distinct token signature.
Phase 1: Plan
The planning phase is where the agent receives a goal and produces a structured plan or next-action decision. Token consumption here includes:
- System prompt: Injected on every call. This is often the most overlooked cost. A 2,000-token system prompt multiplied by 20 iterations equals 40,000 tokens before the agent has done any real work.
- Accumulated conversation history: Every prior assistant turn, every prior user message, every prior tool result — all re-sent.
- Goal description: The original task, possibly re-stated or summarized.
- Available tools schema: In frameworks like OpenAI Assistants and Claude's API, the tools array is serialized into the prompt. Ten well-documented tools can add 1,500–3,000 tokens per call.
A typical planning call breakdown for iteration N:
System prompt: ~2,000 tokens
Prior conversation (N-1 turns × avg 500 tokens/turn): variable
Original goal: ~300 tokens
Tools schema: ~2,000 tokens
────────────────────────────────────
Iteration 3 planning input: ~8,300 tokens (before any results)
Iteration 10 planning input: ~22,000+ tokens
Phase 2: Execute
The execution phase is where the agent invokes a tool or performs a subtask. Token consumption here includes:
- Tool call output (from the LLM): Usually compact — a JSON-formatted tool invocation.
- Tool result injection: The raw output of the tool is injected back as a new message. This is frequently the largest single source of token explosion. A file-reading tool that returns 3,000 lines of code, a web search that returns full page HTML, or a database query that returns 500 rows — all of these bloat the context for every subsequent iteration.
Example: Agent reads a 500-line Python file
Tool result injection: ~2,500 tokens
This payload stays in context for iterations 4 through 20
Cost: 2,500 × 16 remaining iterations = 40,000 extra tokens
Phase 3: Verify
The verification phase is where the agent checks whether the executed action achieved the goal. Token costs here include:
- Re-evaluation of full accumulated context: By this point, every prior plan, execution result, and verification is in context.
- Self-reflection output: The agent's reasoning about whether the task is done. This output becomes input for the next planning phase.
Tip: Build a token accounting spreadsheet for your agentic system before optimizing. Log: (iteration number, system prompt tokens, history tokens, tool schema tokens, new tool result tokens, assistant output tokens). Run your agent on a benchmark task and fill in the spreadsheet. You will almost always find that either the system prompt or tool results dominate — not the actual reasoning output.
Token Accumulation: The Compounding Growth Problem
The most dangerous property of agentic loops is that token consumption is not linear — it is roughly quadratic if no pruning is applied. This is because each iteration appends new content to the context, and that entire context is re-sent on the next iteration.
Iteration 1: I₁ tokens (baseline)
Iteration 2: I₁ + Δ₂ tokens
Iteration 3: I₁ + Δ₂ + Δ₃ tokens
...
Iteration N: I₁ + Σ(Δᵢ for i=2..N) tokens
Total tokens across N iterations:
= N×I₁ + (N-1)×Δ₂ + (N-2)×Δ₃ + ... + 1×Δₙ
For a typical agentic coding task with N=15 iterations and I₁=5,000, Δ_avg=800 per iteration:
Total tokens ≈ 15×5,000 + Σ(remaining multipliers × 800)
≈ 75,000 + 84,000
≈ 159,000 tokens
Without accumulation awareness, engineers budget for "about 15 calls × 5,000 tokens = 75,000 tokens" and are shocked when the bill is double.
Tip: Set a token counter on your agentic framework's LLM call wrapper. Log cumulative tokens after every iteration and graph them. Most teams discover that 60–70% of their total tokens are consumed in the last 30% of iterations, when context is largest. This makes the case for pruning strategies (covered in later topics) crystal clear.
Anatomy of a Real Agentic Call: A LangGraph Example
Here is a concrete LangGraph node that makes a planning call. The comment annotations show where each token category lives:
from langgraph.graph import StateGraph
from langchain_core.messages import SystemMessage, HumanMessage, ToolMessage
def planning_node(state: AgentState) -> AgentState:
messages = [
# CATEGORY 1: System prompt — re-sent every iteration
SystemMessage(content=SYSTEM_PROMPT), # ~2,000 tokens
# CATEGORY 2: Original goal — re-sent every iteration
HumanMessage(content=state["original_goal"]), # ~300 tokens
# CATEGORY 3: Full conversation history — grows every iteration
*state["messages"], # GROWS UNBOUNDED
# CATEGORY 4: New user input (if any)
# (none in plan phase, handled separately)
]
response = llm.invoke(
messages,
tools=TOOL_SCHEMAS # CATEGORY 5: Tool schemas — re-sent every iteration
)
state["messages"].append(response)
return state
In this structure, state["messages"] is the ticking clock. Every tool result appended to this list will be re-paid on every future iteration.
Where CrewAI and AutoGen Differ
In CrewAI, each agent maintains its own memory object, but inter-agent communication is serialized into message payloads. When one agent's output becomes another agent's input, the full output string is typically included verbatim. This makes verbose agent outputs especially expensive.
In AutoGen, the GroupChat or TwoAgentChat pattern re-sends the entire conversation history to each participant on every round. In a 5-agent group chat with 20 rounds, each message is re-sent to 5 agents, making the total token cost 5× the naive estimate.
Tip: In CrewAI, set verbose=False on agents during production runs and use max_iter to cap loop depth. In AutoGen, consider using SocietyOfMindAgent or custom summarizer hooks to compress history between rounds rather than accumulating raw messages.
Tool Schema Costs: The Hidden Baseline Tax
Every tool available to the agent must be described to the model on every call. The description, parameter names, types, and docstrings all count as input tokens.
Here is an example of measuring tool schema token costs:
import tiktoken
def measure_tool_schema_cost(tools: list[dict]) -> int:
enc = tiktoken.encoding_for_model("gpt-4o")
import json
serialized = json.dumps(tools)
return len(enc.encode(serialized))
The implication: Tool schemas are a fixed per-iteration tax. Reducing the number of available tools (by scoping them to the current task phase) directly reduces this tax. A tool that is not relevant in the planning phase should not be in the schema during the planning call.
Tip: Audit your tool schema sizes with a token counter before your first production deployment. Tools with long enum value lists, extensive examples in their docstrings, or nested object schemas are the biggest offenders. Consider abbreviating parameter descriptions in production schemas while keeping verbose versions only in development/documentation.
Measuring Token Distribution in Your Own System
Before you can optimize, you need instrumentation. Here is a minimal token accounting middleware you can wrap around any LLM call in your agentic system:
import tiktoken
from dataclasses import dataclass, field
from typing import Any
@dataclass
class TokenLedger:
iteration: int = 0
system_prompt_tokens: list[int] = field(default_factory=list)
history_tokens: list[int] = field(default_factory=list)
tool_schema_tokens: list[int] = field(default_factory=list)
tool_result_tokens: list[int] = field(default_factory=list)
output_tokens: list[int] = field(default_factory=list)
def record(self, category: str, count: int):
getattr(self, f"{category}_tokens").append(count)
def report(self):
for i in range(len(self.system_prompt_tokens)):
total = (
self.system_prompt_tokens[i]
+ self.history_tokens[i]
+ self.tool_schema_tokens[i]
+ self.tool_result_tokens[i]
+ self.output_tokens[i]
)
print(
f"Iter {i+1}: sys={self.system_prompt_tokens[i]} "
f"hist={self.history_tokens[i]} "
f"tools={self.tool_schema_tokens[i]} "
f"results={self.tool_result_tokens[i]} "
f"out={self.output_tokens[i]} "
f"TOTAL={total}"
)
enc = tiktoken.encoding_for_model("gpt-4o")
def count_tokens(text: str) -> int:
return len(enc.encode(text))
Run this instrumentation for 10 representative tasks, average the per-category proportions, and you have a data-driven optimization target. Typically you will find:
| Category | Typical Share |
|---|---|
| System prompt | 10–25% |
| Accumulated history | 35–60% |
| Tool schemas | 10–20% |
| Tool results | 15–30% |
| Model output | 5–15% |
The history and tool results together usually account for 50–80% of total tokens. This is where optimization effort pays off most.
Tip: Build this instrumentation as a decorator or middleware layer that wraps your LLM client once, rather than instrumenting each agent node individually. A single wrapper applied at the LLM call level captures all token flows regardless of which node or agent is making the call.
Key Takeaways
- Agentic loops have five token categories: system prompt, conversation history, tool schemas, tool results, and model output.
- History and tool result injection are the dominant and growing costs — they compound quadratically without pruning.
- Tool schemas represent a fixed per-iteration tax that can be reduced by scoping available tools to the current phase.
- Instrumentation before optimization is not optional — you must measure token distribution by category to know where to focus.
- The same architectural patterns appear across LangGraph, CrewAI, AutoGen, and OpenAI Assistants — the categories are universal, only the API surface differs.
The remaining topics in this module address each major cost category with concrete reduction strategies.