Single-shot LLM calls are expensive but predictable. Agentic loops are expensive in a fundamentally different way: costs compound geometrically as the agent iterates, and without deliberate design, a task that should cost $0.05 ends up costing $2.00 because the agent re-read the same files twelve times, re-appended its full history to every request, and generated redundant intermediate summaries nobody asked for. This topic dissects the token cost anatomy of agentic loops and provides the engineering patterns to control them.
The Anatomy of an Agentic Loop — Where Every Token Goes
An agentic loop follows a Plan → Execute → Observe → Plan cycle. Each iteration of this cycle incurs a full token bill: the entire accumulated context must be re-sent on every API call, and the model generates both reasoning and action tokens before producing any observable result.
A single agentic loop iteration — token breakdown:
Iteration N of a code debugging agent
──────────────────────────────────────────────────────────────────
Input tokens sent to model:
System prompt: 800 tokens (stable)
Tool definitions (×6): 1,500 tokens (stable)
Conversation history (turns 1-N): 4,200 tokens (GROWS each iteration)
Tool results from prior steps: 2,100 tokens (GROWS each iteration)
Current observation/task: 300 tokens
──────────────────────────────────────────────
Total input: 8,900 tokens
Output tokens generated:
Reasoning text: 240 tokens
Tool call (function + arguments): 180 tokens
──────────────────────────────────────────────
Total output: 420 tokens
This iteration total: 9,320 tokens
──────────────────────────────────────────────────────────────────
The compounding problem — token accumulation across iterations:
In most agentic frameworks (LangChain, LlamaIndex, CrewAI, custom loops), the full conversation history — including every tool call and every tool result — is appended to the next request. This means:
Iteration 1: 1,500 tokens in history → 5,200 total input
Iteration 2: 3,100 tokens in history → 6,800 total input (+31%)
Iteration 3: 4,800 tokens in history → 8,500 total input (+25%)
Iteration 4: 6,600 tokens in history → 10,300 total input (+21%)
Iteration 5: 8,700 tokens in history → 12,400 total input (+20%)
...
Iteration 10: 22,000 tokens in history → 25,700 total input
By iteration 10, the input size is 5× what it was at iteration 1 — even if the new information added in each iteration was minimal.
Cumulative cost of a 10-iteration debugging session:
def model_agentic_loop_cost(
iterations: int,
system_tokens: int = 800,
tools_tokens: int = 1500,
history_growth_per_iter: int = 800, # avg tokens added to history each iter
initial_context_tokens: int = 1000,
output_per_iter: int = 420,
input_price_per_million: float = 3.00,
output_price_per_million: float = 15.00,
) -> dict:
total_input = 0
total_output = 0
fixed = system_tokens + tools_tokens
per_iter = []
for i in range(1, iterations + 1):
history_so_far = initial_context_tokens + (i - 1) * history_growth_per_iter
input_tokens = fixed + history_so_far
total_input += input_tokens
total_output += output_per_iter
per_iter.append({"iteration": i, "input": input_tokens, "cumulative_input": total_input})
cost = (total_input / 1e6 * input_price_per_million
+ total_output / 1e6 * output_price_per_million)
print(f"{'Iter':>5} | {'Input':>8} | {'Cumulative Input':>17}")
print("-" * 38)
for row in per_iter:
print(f"{row['iteration']:>5} | {row['input']:>8,} | {row['cumulative_input']:>17,}")
print(f"\nTotal input tokens: {total_input:,}")
print(f"Total output tokens: {total_output:,}")
print(f"Total cost: ${cost:.4f}")
return {"total_input": total_input, "total_output": total_output, "cost": cost}
result = model_agentic_loop_cost(iterations=10)
The same task repeated as a flat 10-step loop with no optimization costs roughly $0.19 on Claude Sonnet. With an unoptimized agent that accumulates all tool results verbatim, the same 10 iterations can cost $0.50–$1.20, depending on how verbose the tool outputs are.
Tip: Before optimizing, instrument your existing agentic loops to log the token count for each iteration separately. Most teams are shocked to discover that iterations 7–10 of a typical coding agent are 4–6× more expensive than iteration 1 — not because the task is harder, but because the context has ballooned with accumulated history. The data makes the case for optimization more powerfully than any diagram.
The Four Multiplication Factors of Agentic Token Costs
Understanding exactly why agentic loops multiply costs helps you target the right fix. There are four distinct multiplication mechanisms:
1. History Re-transmission (Linear Growth)
The most common factor. Every prior turn — both model messages and tool results — is re-sent in full on every subsequent request. A 20-turn debugging session transmits turn 1 a total of 20 times, turn 2 a total of 19 times, and so on. The total transmission of turn-1 content is 20× its original token count.
2. Verbose Tool Output Accumulation (Exponential Risk)
When tools return large outputs (file reads, API responses, search results), those outputs are added to history and re-sent verbatim on all subsequent iterations. A single read_file call that returns 3,000 tokens of file content will be re-transmitted on every subsequent iteration — potentially adding 30,000+ tokens of redundant data across a 10-iteration session.
tool_result = {
"tool": "read_file",
"output": "[entire 3000-token file contents]" # re-sent on EVERY future iteration
}
tool_result = {
"tool": "read_file",
"output": "File read successfully. Key findings: function `validate_user()` at line 142 "
"uses MD5 for password hashing (VULNERABILITY). "
"Full contents cached in session_store['user_service.py']."
# Token cost: ~50 tokens instead of 3000, with the important info preserved
}
3. Re-planning Overhead (Multiplicative)
Many agentic frameworks re-generate a full plan at the start of each iteration: "Given what I know so far, my plan is: Step 1... Step 2... Step 3...". This planning output — which can be 200–500 tokens — is then stored in history and re-sent on all future iterations. A 10-iteration agent with 300 tokens of planning output per iteration accumulates 3,000 tokens of planning text that is re-transmitted cumulatively.
4. Error Recovery Loops (Exponential Waste)
When an agent fails a step (a tool call returns an error, an assertion fails, the model generates invalid JSON), the naive response is to re-try with full context. In poorly designed agents, a single failure can trigger 3–5 re-try iterations, each paying the full compounded context cost. An error at iteration 7 of a 10-iteration agent can double the total cost of the session.
Tip: Profile each of these four factors independently in your agent. Build logging that tracks: (1) total history token growth per iteration, (2) tool output size per call, (3) re-planning token generation, (4) error recovery iteration count. This four-metric dashboard will immediately reveal where your agent's token budget is leaking.
Tool Result Compression — Defusing the Biggest Multiplier
Tool output accumulation is typically the highest-leverage optimization target. The pattern is simple: instead of storing full tool output in the agent's working memory (conversation history), you store a compressed summary and the raw data in a separate store.
Implementing a tool result compressor:
import anthropic
from typing import Any
import hashlib
class ToolResultCompressor:
"""
Compress large tool outputs before they enter the agent's conversation history.
Stores full output in a separate cache for reference.
"""
def __init__(self, max_inline_tokens: int = 300):
self.max_inline_tokens = max_inline_tokens
self._cache: dict[str, str] = {}
self._client = anthropic.Anthropic()
def _token_count(self, text: str) -> int:
# Simple approximation: 4 chars ≈ 1 token
return len(text) // 4
def compress(self, tool_name: str, tool_output: Any) -> str:
"""
If tool output is small, return as-is.
If large, compress to a summary and cache the full output.
"""
output_str = str(tool_output)
token_count = self._token_count(output_str)
if token_count <= self.max_inline_tokens:
return output_str # Small enough to include directly
# Store full output in cache
cache_key = hashlib.md5(output_str.encode()).hexdigest()[:8]
self._cache[cache_key] = output_str
# Generate a compact summary
summary_prompt = f"""Summarize this {tool_name} output in under 100 words.
Preserve: key values, identifiers, errors, important findings, and any numbers.
Discard: boilerplate, verbose descriptions, repeated information.
Output:
{output_str[:4000]} # Send first 4000 chars to summarizer
"""
response = self._client.messages.create(
model="claude-haiku-3-5", # Use cheaper model for compression
max_tokens=200,
messages=[{"role": "user", "content": summary_prompt}]
)
summary = response.content[0].text
return f"[Compressed — cache_key:{cache_key}]\n{summary}"
def retrieve_full(self, cache_key: str) -> str | None:
return self._cache.get(cache_key)
def run_agent_with_compression(task: str, tools: list) -> str:
client = anthropic.Anthropic()
compressor = ToolResultCompressor(max_inline_tokens=200)
messages = []
messages.append({"role": "user", "content": task})
while True:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
tools=tools,
messages=messages
)
if response.stop_reason == "end_turn":
return response.content[0].text
if response.stop_reason == "tool_use":
# Process tool calls
tool_use_blocks = [b for b in response.content if b.type == "tool_use"]
# Add assistant message (tool calls)
messages.append({"role": "assistant", "content": response.content})
# Execute tools and compress results
tool_results = []
for tool_call in tool_use_blocks:
raw_result = execute_tool(tool_call.name, tool_call.input)
compressed = compressor.compress(tool_call.name, raw_result)
tool_results.append({
"type": "tool_result",
"tool_use_id": tool_call.id,
"content": compressed
})
messages.append({"role": "user", "content": tool_results})
def execute_tool(name: str, inputs: dict) -> Any:
# Placeholder — replace with actual tool execution
return f"Tool {name} executed with inputs: {inputs}"
Token savings from tool result compression:
Without compression (10 iterations, avg 2 tool calls with 1,500-token outputs):
History growth: ~3,200 tokens/iteration
Total input tokens across 10 iterations: ~198,000 tokens
With compression (same agent, tool outputs compressed to ~150 tokens):
History growth: ~350 tokens/iteration
Total input tokens across 10 iterations: ~48,000 tokens
Reduction: 75.8% fewer input tokens, $0.45 → $0.11 per session on Claude Sonnet
Tip: Use a cheaper, faster model (Claude Haiku, GPT-4o mini) for tool result compression. The compression task is straightforward and does not require the full capability of the primary agent model. This keeps the compression overhead (latency and cost) minimal while delivering major savings on the primary model's context costs.
History Management Strategies — Preventing Unbounded Growth
History growth is the linear component of token multiplication, but it is persistent and guaranteed. Every agent session that lacks a history management strategy will eventually hit context limits or cost ceilings.
Strategy 1: Fixed-size sliding window
Keep only the last N turns. Simple, predictable, but loses earlier context.
def sliding_window_history(
messages: list[dict],
max_turns: int = 10
) -> list[dict]:
"""Keep last N complete turns (user+assistant pairs)."""
# Always keep the first exchange (initial task context)
if len(messages) <= 2:
return messages
first_pair = messages[:2]
recent = messages[2:]
# Keep last max_turns turns (each turn = 2 messages)
max_messages = max_turns * 2
if len(recent) > max_messages:
recent = recent[-max_messages:]
return first_pair + recent
Strategy 2: Milestone-based compression
Compress history into a milestone summary every N iterations.
def compress_history_at_milestone(
messages: list[dict],
milestone_every: int = 5,
iteration: int = 0,
client: anthropic.Anthropic = None
) -> list[dict]:
"""Every N iterations, replace old history with a compact summary."""
if iteration % milestone_every != 0 or iteration == 0:
return messages
if len(messages) < 4:
return messages
# Compress everything except the last 2 turns
to_compress = messages[:-2]
keep = messages[-2:]
history_text = "\n".join(
f"{m['role'].upper()}: {m['content'] if isinstance(m['content'], str) else '[tool interaction]'}"
for m in to_compress
)
summary_response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=400,
messages=[{
"role": "user",
"content": f"Summarize this agent session history in under 150 words. "
f"Preserve: decisions made, files modified, errors encountered, "
f"current state, next steps planned.\n\n{history_text}"
}]
)
summary = summary_response.content[0].text
summary_message = {
"role": "user",
"content": f"[Session history summary — {len(to_compress)} turns compressed]\n{summary}"
}
return [summary_message] + keep
Strategy 3: Semantic deduplication
If the agent is reading the same files or calling the same tools repeatedly (common in debugging loops), deduplicate the results.
class DeduplicatingHistory:
"""Track what information has already been added to history."""
def __init__(self):
self._seen_content_hashes: set[str] = set()
self.messages: list[dict] = []
def add(self, message: dict) -> bool:
"""Add message to history. Returns False if content was already seen."""
content = message.get("content", "")
if not isinstance(content, str):
content = str(content)
content_hash = hashlib.md5(content.encode()).hexdigest()
if content_hash in self._seen_content_hashes:
# Replace with dedup note
self.messages.append({
"role": message["role"],
"content": "[Deduplicated — identical content already in context]"
})
return False
self._seen_content_hashes.add(content_hash)
self.messages.append(message)
return True
Tip: For product managers and QA engineers: the choice of history management strategy is a product decision, not just an engineering one. Sliding windows are cheap and simple but create "amnesiac agents" that forget early context. Compression preserves continuity but adds latency. Deduplication requires no extra LLM calls but only helps in certain agent patterns. Document which strategy your agent uses and validate in testing that the chosen strategy does not cause the agent to lose track of critical task state.
Sub-Agent Delegation — Scoping Costs Through Decomposition
The single most powerful architectural pattern for controlling agentic token costs is sub-agent delegation: breaking a large task into scoped sub-tasks, each handled by an independent agent with its own minimal context.
Why scoping matters for token costs:
Monolithic agent (single context):
System prompt: 800 tokens (shared)
Full codebase context: 45,000 tokens (all loaded upfront)
Task instructions: 500 tokens
History (20 turns): 12,000 tokens (grows unboundedly)
────────────────────────────────────
Per-request input: 58,300 tokens
Cost for 20-turn session:
Sum(800 + 45,000 + 500 + i*600 for i in 1..20) = ~1.34M input tokens
= $4.02 on Claude Sonnet
Sub-agent architecture (orchestrator + 4 specialized agents):
Orchestrator (minimal context): 2,000 tokens/call × 5 calls = 10,000
File analysis agent (targeted files): 8,000 tokens/call × 3 calls = 24,000
Test generator agent (focused spec): 5,000 tokens/call × 4 calls = 20,000
Code reviewer agent (single PR): 6,000 tokens/call × 2 calls = 12,000
Summary agent (compressed outputs): 3,000 tokens/call × 1 call = 3,000
────────────────────────────────────────────────────────────────────────────
Total input tokens: 69,000
= $0.21 on Claude Sonnet
Savings: $3.81 (94.8%) for equivalent functional output
Implementing a lightweight orchestrator pattern:
import anthropic
import json
from typing import Callable
@dataclass
class SubTask:
id: str
description: str
context: str # Only the context THIS sub-agent needs
expected_output: str # What format/content we expect back
class OrchestratorAgent:
def __init__(self, client: anthropic.Anthropic):
self.client = client
def decompose_task(self, task: str) -> list[SubTask]:
"""Use the orchestrator to break a task into minimal sub-tasks."""
response = self.client.messages.create(
model="claude-haiku-3-5", # Cheap model for decomposition
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Break this task into independent sub-tasks.
For each sub-task, specify ONLY the context it actually needs.
Return JSON array: [{{"id": "...", "description": "...", "required_context": "...", "expected_output": "..."}}]
Task: {task}"""
}]
)
return json.loads(response.content[0].text)
def run_sub_agent(
self,
sub_task: SubTask,
context_loader: Callable[[str], str],
model: str = "claude-haiku-3-5"
) -> str:
"""Run a sub-agent with minimal, scoped context."""
context = context_loader(sub_task.description)
response = self.client.messages.create(
model=model,
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Task: {sub_task.description}\n\n"
f"Context:\n{context}\n\n"
f"Return: {sub_task.expected_output}"
}]
)
return response.content[0].text
def aggregate_results(self, results: list[str], original_task: str) -> str:
"""Aggregate sub-agent results into a final answer."""
results_text = "\n\n---\n\n".join(
f"Sub-task result {i+1}:\n{r}" for i, r in enumerate(results)
)
response = self.client.messages.create(
model="claude-haiku-3-5",
max_tokens=800,
messages=[{
"role": "user",
"content": f"Original task: {original_task}\n\n"
f"Sub-results:\n{results_text}\n\n"
f"Synthesize these into a final, coherent answer."
}]
)
return response.content[0].text
Tip: When designing sub-agent boundaries, ask: "What is the minimum context this sub-agent needs to complete its specific task?" A sub-agent that analyzes one file should receive only that file plus relevant interface definitions — not the entire codebase. This scoping discipline is the fundamental driver of token efficiency in multi-agent systems and is covered in depth in Module 8.
Early Termination and Exit Conditions — Knowing When to Stop
Many agentic loops run longer than necessary because they lack effective exit conditions. An agent that is "almost done" at iteration 6 but runs to iteration 15 because it keeps second-guessing itself has wasted 9 iterations worth of compounded context costs.
Implementing robust exit conditions:
from enum import Enum
class ExitReason(Enum):
TASK_COMPLETE = "task_complete"
CONFIDENCE_THRESHOLD_MET = "confidence_threshold"
MAX_ITERATIONS_REACHED = "max_iterations"
TOKEN_BUDGET_EXHAUSTED = "token_budget"
ERROR_LIMIT_REACHED = "error_limit"
HUMAN_ESCALATION_REQUIRED = "human_escalation"
class AgentExitController:
def __init__(
self,
max_iterations: int = 15,
max_tokens: int = 100_000,
max_consecutive_errors: int = 3,
confidence_threshold: float = 0.85
):
self.max_iterations = max_iterations
self.max_tokens = max_tokens
self.max_consecutive_errors = max_consecutive_errors
self.confidence_threshold = confidence_threshold
self.iteration_count = 0
self.total_tokens = 0
self.consecutive_errors = 0
self.task_complete = False
self.confidence = 0.0
def check_exit(self) -> tuple[bool, ExitReason | None]:
"""Returns (should_exit, reason). Call after each iteration."""
if self.task_complete:
return True, ExitReason.TASK_COMPLETE
if self.confidence >= self.confidence_threshold:
return True, ExitReason.CONFIDENCE_THRESHOLD_MET
if self.iteration_count >= self.max_iterations:
return True, ExitReason.MAX_ITERATIONS_REACHED
if self.total_tokens >= self.max_tokens:
return True, ExitReason.TOKEN_BUDGET_EXHAUSTED
if self.consecutive_errors >= self.max_consecutive_errors:
return True, ExitReason.ERROR_LIMIT_REACHED
return False, None
def record_iteration(
self,
tokens_used: int,
had_error: bool,
task_complete: bool,
confidence: float = 0.0
) -> None:
self.iteration_count += 1
self.total_tokens += tokens_used
self.task_complete = task_complete
self.confidence = confidence
if had_error:
self.consecutive_errors += 1
else:
self.consecutive_errors = 0 # Reset on success
controller = AgentExitController(max_iterations=10, max_tokens=80_000)
while True:
# Run one iteration
result = run_one_iteration()
controller.record_iteration(
tokens_used=result.tokens,
had_error=result.had_error,
task_complete=result.is_complete,
confidence=result.confidence_score
)
should_exit, reason = controller.check_exit()
if should_exit:
print(f"Agent stopped: {reason.value}")
if reason == ExitReason.HUMAN_ESCALATION_REQUIRED:
escalate_to_human(result)
break
Tip: For QA engineers: test your agent's exit conditions as thoroughly as you test its happy path. Specifically test: (1) that it terminates correctly when the task is done, (2) that it terminates gracefully when it hits the iteration limit, and (3) that it does not enter an infinite loop when a tool repeatedly fails. These failure modes are not edge cases — they are the most common source of runaway token costs in production agentic systems.