Agentic Loops & Token Costs | Token Optimization Masterclass

Single-shot LLM calls are expensive but predictable. Agentic loops are expensive in a fundamentally different way: costs compound geometrically as the agent iterates, and without deliberate design, a task that should cost $0.05 ends up costing $2.00 because the agent re-read the same files twelve times, re-appended its full history to every request, and generated redundant intermediate summaries nobody asked for. This topic dissects the token cost anatomy of agentic loops and provides the engineering patterns to control them.

The Anatomy of an Agentic Loop — Where Every Token Goes

An agentic loop follows a Plan → Execute → Observe → Plan cycle. Each iteration of this cycle incurs a full token bill: the entire accumulated context must be re-sent on every API call, and the model generates both reasoning and action tokens before producing any observable result.

A single agentic loop iteration — token breakdown:

Iteration N of a code debugging agent
──────────────────────────────────────────────────────────────────
Input tokens sent to model:
  System prompt:                     800 tokens  (stable)
  Tool definitions (×6):           1,500 tokens  (stable)
  Conversation history (turns 1-N): 4,200 tokens  (GROWS each iteration)
  Tool results from prior steps:    2,100 tokens  (GROWS each iteration)
  Current observation/task:           300 tokens
  ──────────────────────────────────────────────
  Total input:                       8,900 tokens

Output tokens generated:
  Reasoning text:                     240 tokens
  Tool call (function + arguments):   180 tokens
  ──────────────────────────────────────────────
  Total output:                        420 tokens

This iteration total: 9,320 tokens
──────────────────────────────────────────────────────────────────

The compounding problem — token accumulation across iterations:

In most agentic frameworks (LangChain, LlamaIndex, CrewAI, custom loops), the full conversation history — including every tool call and every tool result — is appended to the next request. This means:

Iteration 1:  1,500 tokens in history → 5,200 total input
Iteration 2:  3,100 tokens in history → 6,800 total input  (+31%)
Iteration 3:  4,800 tokens in history → 8,500 total input  (+25%)
Iteration 4:  6,600 tokens in history → 10,300 total input (+21%)
Iteration 5:  8,700 tokens in history → 12,400 total input (+20%)
...
Iteration 10: 22,000 tokens in history → 25,700 total input

By iteration 10, the input size is 5× what it was at iteration 1 — even if the new information added in each iteration was minimal.

Cumulative cost of a 10-iteration debugging session:

def model_agentic_loop_cost(
    iterations: int,
    system_tokens: int = 800,
    tools_tokens: int = 1500,
    history_growth_per_iter: int = 800,  # avg tokens added to history each iter
    initial_context_tokens: int = 1000,
    output_per_iter: int = 420,
    input_price_per_million: float = 3.00,
    output_price_per_million: float = 15.00,
) -> dict:
    total_input = 0
    total_output = 0
    fixed = system_tokens + tools_tokens

    per_iter = []
    for i in range(1, iterations + 1):
        history_so_far = initial_context_tokens + (i - 1) * history_growth_per_iter
        input_tokens = fixed + history_so_far
        total_input += input_tokens
        total_output += output_per_iter
        per_iter.append({"iteration": i, "input": input_tokens, "cumulative_input": total_input})

    cost = (total_input / 1e6 * input_price_per_million 
            + total_output / 1e6 * output_price_per_million)

    print(f"{'Iter':>5} | {'Input':>8} | {'Cumulative Input':>17}")
    print("-" * 38)
    for row in per_iter:
        print(f"{row['iteration']:>5} | {row['input']:>8,} | {row['cumulative_input']:>17,}")
    print(f"\nTotal input tokens:  {total_input:,}")
    print(f"Total output tokens: {total_output:,}")
    print(f"Total cost:          ${cost:.4f}")
    return {"total_input": total_input, "total_output": total_output, "cost": cost}

result = model_agentic_loop_cost(iterations=10)

The same task repeated as a flat 10-step loop with no optimization costs roughly $0.19 on Claude Sonnet. With an unoptimized agent that accumulates all tool results verbatim, the same 10 iterations can cost $0.50–$1.20, depending on how verbose the tool outputs are.

Tip: Before optimizing, instrument your existing agentic loops to log the token count for each iteration separately. Most teams are shocked to discover that iterations 7–10 of a typical coding agent are 4–6× more expensive than iteration 1 — not because the task is harder, but because the context has ballooned with accumulated history. The data makes the case for optimization more powerfully than any diagram.

The Four Multiplication Factors of Agentic Token Costs

Understanding exactly why agentic loops multiply costs helps you target the right fix. There are four distinct multiplication mechanisms:

1. History Re-transmission (Linear Growth)
The most common factor. Every prior turn — both model messages and tool results — is re-sent in full on every subsequent request. A 20-turn debugging session transmits turn 1 a total of 20 times, turn 2 a total of 19 times, and so on. The total transmission of turn-1 content is 20× its original token count.

2. Verbose Tool Output Accumulation (Exponential Risk)
When tools return large outputs (file reads, API responses, search results), those outputs are added to history and re-sent verbatim on all subsequent iterations. A single read_file call that returns 3,000 tokens of file content will be re-transmitted on every subsequent iteration — potentially adding 30,000+ tokens of redundant data across a 10-iteration session.

tool_result = {
    "tool": "read_file",
    "output": "[entire 3000-token file contents]"  # re-sent on EVERY future iteration
}

tool_result = {
    "tool": "read_file", 
    "output": "File read successfully. Key findings: function `validate_user()` at line 142 "
              "uses MD5 for password hashing (VULNERABILITY). "
              "Full contents cached in session_store['user_service.py']."
    # Token cost: ~50 tokens instead of 3000, with the important info preserved
}

3. Re-planning Overhead (Multiplicative)
Many agentic frameworks re-generate a full plan at the start of each iteration: "Given what I know so far, my plan is: Step 1... Step 2... Step 3...". This planning output — which can be 200–500 tokens — is then stored in history and re-sent on all future iterations. A 10-iteration agent with 300 tokens of planning output per iteration accumulates 3,000 tokens of planning text that is re-transmitted cumulatively.

4. Error Recovery Loops (Exponential Waste)
When an agent fails a step (a tool call returns an error, an assertion fails, the model generates invalid JSON), the naive response is to re-try with full context. In poorly designed agents, a single failure can trigger 3–5 re-try iterations, each paying the full compounded context cost. An error at iteration 7 of a 10-iteration agent can double the total cost of the session.

Tip: Profile each of these four factors independently in your agent. Build logging that tracks: (1) total history token growth per iteration, (2) tool output size per call, (3) re-planning token generation, (4) error recovery iteration count. This four-metric dashboard will immediately reveal where your agent's token budget is leaking.

Tool Result Compression — Defusing the Biggest Multiplier

Tool output accumulation is typically the highest-leverage optimization target. The pattern is simple: instead of storing full tool output in the agent's working memory (conversation history), you store a compressed summary and the raw data in a separate store.

Implementing a tool result compressor:

import anthropic
from typing import Any
import hashlib

class ToolResultCompressor:
    """
    Compress large tool outputs before they enter the agent's conversation history.
    Stores full output in a separate cache for reference.
    """

    def __init__(self, max_inline_tokens: int = 300):
        self.max_inline_tokens = max_inline_tokens
        self._cache: dict[str, str] = {}
        self._client = anthropic.Anthropic()

    def _token_count(self, text: str) -> int:
        # Simple approximation: 4 chars ≈ 1 token
        return len(text) // 4

    def compress(self, tool_name: str, tool_output: Any) -> str:
        """
        If tool output is small, return as-is.
        If large, compress to a summary and cache the full output.
        """
        output_str = str(tool_output)
        token_count = self._token_count(output_str)

        if token_count <= self.max_inline_tokens:
            return output_str  # Small enough to include directly

        # Store full output in cache
        cache_key = hashlib.md5(output_str.encode()).hexdigest()[:8]
        self._cache[cache_key] = output_str

        # Generate a compact summary
        summary_prompt = f"""Summarize this {tool_name} output in under 100 words.
Preserve: key values, identifiers, errors, important findings, and any numbers.
Discard: boilerplate, verbose descriptions, repeated information.

Output:
{output_str[:4000]}  # Send first 4000 chars to summarizer
"""
        response = self._client.messages.create(
            model="claude-haiku-3-5",  # Use cheaper model for compression
            max_tokens=200,
            messages=[{"role": "user", "content": summary_prompt}]
        )

        summary = response.content[0].text
        return f"[Compressed — cache_key:{cache_key}]\n{summary}"

    def retrieve_full(self, cache_key: str) -> str | None:
        return self._cache.get(cache_key)


def run_agent_with_compression(task: str, tools: list) -> str:
    client = anthropic.Anthropic()
    compressor = ToolResultCompressor(max_inline_tokens=200)
    messages = []

    messages.append({"role": "user", "content": task})

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            return response.content[0].text

        if response.stop_reason == "tool_use":
            # Process tool calls
            tool_use_blocks = [b for b in response.content if b.type == "tool_use"]

            # Add assistant message (tool calls)
            messages.append({"role": "assistant", "content": response.content})

            # Execute tools and compress results
            tool_results = []
            for tool_call in tool_use_blocks:
                raw_result = execute_tool(tool_call.name, tool_call.input)
                compressed = compressor.compress(tool_call.name, raw_result)

                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_call.id,
                    "content": compressed
                })

            messages.append({"role": "user", "content": tool_results})

def execute_tool(name: str, inputs: dict) -> Any:
    # Placeholder — replace with actual tool execution
    return f"Tool {name} executed with inputs: {inputs}"

Token savings from tool result compression:

Without compression (10 iterations, avg 2 tool calls with 1,500-token outputs):
  History growth: ~3,200 tokens/iteration
  Total input tokens across 10 iterations: ~198,000 tokens

With compression (same agent, tool outputs compressed to ~150 tokens):
  History growth: ~350 tokens/iteration
  Total input tokens across 10 iterations: ~48,000 tokens

Reduction: 75.8% fewer input tokens, $0.45 → $0.11 per session on Claude Sonnet

Tip: Use a cheaper, faster model (Claude Haiku, GPT-4o mini) for tool result compression. The compression task is straightforward and does not require the full capability of the primary agent model. This keeps the compression overhead (latency and cost) minimal while delivering major savings on the primary model's context costs.

History Management Strategies — Preventing Unbounded Growth

History growth is the linear component of token multiplication, but it is persistent and guaranteed. Every agent session that lacks a history management strategy will eventually hit context limits or cost ceilings.

Strategy 1: Fixed-size sliding window
Keep only the last N turns. Simple, predictable, but loses earlier context.

def sliding_window_history(
    messages: list[dict], 
    max_turns: int = 10
) -> list[dict]:
    """Keep last N complete turns (user+assistant pairs)."""
    # Always keep the first exchange (initial task context)
    if len(messages) <= 2:
        return messages

    first_pair = messages[:2]
    recent = messages[2:]

    # Keep last max_turns turns (each turn = 2 messages)
    max_messages = max_turns * 2
    if len(recent) > max_messages:
        recent = recent[-max_messages:]

    return first_pair + recent

Strategy 2: Milestone-based compression
Compress history into a milestone summary every N iterations.

def compress_history_at_milestone(
    messages: list[dict],
    milestone_every: int = 5,
    iteration: int = 0,
    client: anthropic.Anthropic = None
) -> list[dict]:
    """Every N iterations, replace old history with a compact summary."""
    if iteration % milestone_every != 0 or iteration == 0:
        return messages

    if len(messages) < 4:
        return messages

    # Compress everything except the last 2 turns
    to_compress = messages[:-2]
    keep = messages[-2:]

    history_text = "\n".join(
        f"{m['role'].upper()}: {m['content'] if isinstance(m['content'], str) else '[tool interaction]'}"
        for m in to_compress
    )

    summary_response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=400,
        messages=[{
            "role": "user",
            "content": f"Summarize this agent session history in under 150 words. "
                       f"Preserve: decisions made, files modified, errors encountered, "
                       f"current state, next steps planned.\n\n{history_text}"
        }]
    )

    summary = summary_response.content[0].text
    summary_message = {
        "role": "user",
        "content": f"[Session history summary — {len(to_compress)} turns compressed]\n{summary}"
    }

    return [summary_message] + keep

Strategy 3: Semantic deduplication
If the agent is reading the same files or calling the same tools repeatedly (common in debugging loops), deduplicate the results.

class DeduplicatingHistory:
    """Track what information has already been added to history."""

    def __init__(self):
        self._seen_content_hashes: set[str] = set()
        self.messages: list[dict] = []

    def add(self, message: dict) -> bool:
        """Add message to history. Returns False if content was already seen."""
        content = message.get("content", "")
        if not isinstance(content, str):
            content = str(content)

        content_hash = hashlib.md5(content.encode()).hexdigest()

        if content_hash in self._seen_content_hashes:
            # Replace with dedup note
            self.messages.append({
                "role": message["role"],
                "content": "[Deduplicated — identical content already in context]"
            })
            return False

        self._seen_content_hashes.add(content_hash)
        self.messages.append(message)
        return True

Tip: For product managers and QA engineers: the choice of history management strategy is a product decision, not just an engineering one. Sliding windows are cheap and simple but create "amnesiac agents" that forget early context. Compression preserves continuity but adds latency. Deduplication requires no extra LLM calls but only helps in certain agent patterns. Document which strategy your agent uses and validate in testing that the chosen strategy does not cause the agent to lose track of critical task state.

Sub-Agent Delegation — Scoping Costs Through Decomposition

The single most powerful architectural pattern for controlling agentic token costs is sub-agent delegation: breaking a large task into scoped sub-tasks, each handled by an independent agent with its own minimal context.

Why scoping matters for token costs:

Monolithic agent (single context):
  System prompt:          800 tokens  (shared)
  Full codebase context: 45,000 tokens  (all loaded upfront)
  Task instructions:       500 tokens
  History (20 turns):   12,000 tokens  (grows unboundedly)
  ────────────────────────────────────
  Per-request input:     58,300 tokens

  Cost for 20-turn session:
  Sum(800 + 45,000 + 500 + i*600 for i in 1..20) = ~1.34M input tokens
  = $4.02 on Claude Sonnet

Sub-agent architecture (orchestrator + 4 specialized agents):
  Orchestrator (minimal context):       2,000 tokens/call × 5 calls = 10,000
  File analysis agent (targeted files): 8,000 tokens/call × 3 calls = 24,000
  Test generator agent (focused spec):  5,000 tokens/call × 4 calls = 20,000
  Code reviewer agent (single PR):      6,000 tokens/call × 2 calls = 12,000
  Summary agent (compressed outputs):   3,000 tokens/call × 1 call  =  3,000
  ────────────────────────────────────────────────────────────────────────────
  Total input tokens: 69,000
  = $0.21 on Claude Sonnet

Savings: $3.81 (94.8%) for equivalent functional output

Implementing a lightweight orchestrator pattern:

import anthropic
import json
from typing import Callable

@dataclass
class SubTask:
    id: str
    description: str
    context: str        # Only the context THIS sub-agent needs
    expected_output: str  # What format/content we expect back

class OrchestratorAgent:
    def __init__(self, client: anthropic.Anthropic):
        self.client = client

    def decompose_task(self, task: str) -> list[SubTask]:
        """Use the orchestrator to break a task into minimal sub-tasks."""
        response = self.client.messages.create(
            model="claude-haiku-3-5",  # Cheap model for decomposition
            max_tokens=1000,
            messages=[{
                "role": "user",
                "content": f"""Break this task into independent sub-tasks. 
For each sub-task, specify ONLY the context it actually needs.
Return JSON array: [{{"id": "...", "description": "...", "required_context": "...", "expected_output": "..."}}]

Task: {task}"""
            }]
        )
        return json.loads(response.content[0].text)

    def run_sub_agent(
        self,
        sub_task: SubTask,
        context_loader: Callable[[str], str],
        model: str = "claude-haiku-3-5"
    ) -> str:
        """Run a sub-agent with minimal, scoped context."""
        context = context_loader(sub_task.description)

        response = self.client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Task: {sub_task.description}\n\n"
                           f"Context:\n{context}\n\n"
                           f"Return: {sub_task.expected_output}"
            }]
        )
        return response.content[0].text

    def aggregate_results(self, results: list[str], original_task: str) -> str:
        """Aggregate sub-agent results into a final answer."""
        results_text = "\n\n---\n\n".join(
            f"Sub-task result {i+1}:\n{r}" for i, r in enumerate(results)
        )

        response = self.client.messages.create(
            model="claude-haiku-3-5",
            max_tokens=800,
            messages=[{
                "role": "user",
                "content": f"Original task: {original_task}\n\n"
                           f"Sub-results:\n{results_text}\n\n"
                           f"Synthesize these into a final, coherent answer."
            }]
        )
        return response.content[0].text

Tip: When designing sub-agent boundaries, ask: "What is the minimum context this sub-agent needs to complete its specific task?" A sub-agent that analyzes one file should receive only that file plus relevant interface definitions — not the entire codebase. This scoping discipline is the fundamental driver of token efficiency in multi-agent systems and is covered in depth in Module 8.

Early Termination and Exit Conditions — Knowing When to Stop

Many agentic loops run longer than necessary because they lack effective exit conditions. An agent that is "almost done" at iteration 6 but runs to iteration 15 because it keeps second-guessing itself has wasted 9 iterations worth of compounded context costs.

Implementing robust exit conditions:

from enum import Enum

class ExitReason(Enum):
    TASK_COMPLETE = "task_complete"
    CONFIDENCE_THRESHOLD_MET = "confidence_threshold"
    MAX_ITERATIONS_REACHED = "max_iterations"
    TOKEN_BUDGET_EXHAUSTED = "token_budget"
    ERROR_LIMIT_REACHED = "error_limit"
    HUMAN_ESCALATION_REQUIRED = "human_escalation"

class AgentExitController:
    def __init__(
        self,
        max_iterations: int = 15,
        max_tokens: int = 100_000,
        max_consecutive_errors: int = 3,
        confidence_threshold: float = 0.85
    ):
        self.max_iterations = max_iterations
        self.max_tokens = max_tokens
        self.max_consecutive_errors = max_consecutive_errors
        self.confidence_threshold = confidence_threshold

        self.iteration_count = 0
        self.total_tokens = 0
        self.consecutive_errors = 0
        self.task_complete = False
        self.confidence = 0.0

    def check_exit(self) -> tuple[bool, ExitReason | None]:
        """Returns (should_exit, reason). Call after each iteration."""
        if self.task_complete:
            return True, ExitReason.TASK_COMPLETE
        if self.confidence >= self.confidence_threshold:
            return True, ExitReason.CONFIDENCE_THRESHOLD_MET
        if self.iteration_count >= self.max_iterations:
            return True, ExitReason.MAX_ITERATIONS_REACHED
        if self.total_tokens >= self.max_tokens:
            return True, ExitReason.TOKEN_BUDGET_EXHAUSTED
        if self.consecutive_errors >= self.max_consecutive_errors:
            return True, ExitReason.ERROR_LIMIT_REACHED
        return False, None

    def record_iteration(
        self, 
        tokens_used: int, 
        had_error: bool, 
        task_complete: bool,
        confidence: float = 0.0
    ) -> None:
        self.iteration_count += 1
        self.total_tokens += tokens_used
        self.task_complete = task_complete
        self.confidence = confidence
        if had_error:
            self.consecutive_errors += 1
        else:
            self.consecutive_errors = 0  # Reset on success

controller = AgentExitController(max_iterations=10, max_tokens=80_000)

while True:
    # Run one iteration
    result = run_one_iteration()

    controller.record_iteration(
        tokens_used=result.tokens,
        had_error=result.had_error,
        task_complete=result.is_complete,
        confidence=result.confidence_score
    )

    should_exit, reason = controller.check_exit()
    if should_exit:
        print(f"Agent stopped: {reason.value}")
        if reason == ExitReason.HUMAN_ESCALATION_REQUIRED:
            escalate_to_human(result)
        break

Tip: For QA engineers: test your agent's exit conditions as thoroughly as you test its happy path. Specifically test: (1) that it terminates correctly when the task is done, (2) that it terminates gracefully when it hits the iteration limit, and (3) that it does not enter an infinite loop when a tool repeatedly fails. These failure modes are not edge cases — they are the most common source of runaway token costs in production agentic systems.