Token Budgets | Token Optimization Masterclass

A token budget is not just a cost control mechanism — it is a quality control mechanism. When an agentic system has no explicit budget, it tends toward verbosity, redundancy, and over-generation. When a budget is thoughtfully set and enforced, it forces precision in both what you ask and what the model returns. This topic covers how to design, implement, and govern token budgets across the full spectrum of agentic workflows.

The Anatomy of an Agentic Token Bill

Every agentic task consumes tokens across multiple categories. To budget effectively, you must understand each category's contribution to the total bill before you can optimize it.

The six categories of token consumption in agentic work:

System prompt tokens — The standing instructions to the agent. Paid on every request unless prompt caching is active.
Conversation history tokens — All prior turns in the session. Grows linearly with conversation length.
Tool definition tokens — The JSON schemas of every tool registered with the agent. Paid on every request.
Retrieved context tokens — Documents, code snippets, database records, or any external content pulled into the context.
Reasoning/thinking tokens — For models with explicit chain-of-thought (Claude's extended thinking, OpenAI's o-series reasoning tokens). These are often billed at a premium.
Output tokens — The model's response, tool call arguments, and any structured output.

Token budget accounting example — a code review agent, single review:

System prompt:            650 tokens   (12%)
Conversation history:     420 tokens   (8%)
Tool definitions (×8):   1,200 tokens  (22%)
Retrieved context (PR):  2,800 tokens  (51%)
Output (review):           410 tokens   (7%)
─────────────────────────────────────────
Total billed:            5,480 tokens
Cost (Claude Sonnet):    $0.0228

At 200 reviews/month, this is $4.56/month — quite reasonable. But notice that 73% of input tokens come from tool definitions and retrieved context, not from the actual user message or history. This pattern holds across most agentic workloads: the framing overhead dominates the conversational content.

Tip: Track token usage per category, not just per request. When you break down the bill by component, you discover the actual optimization targets. Most teams assume conversation history is the biggest cost driver; in practice it is usually tool definitions and retrieved context. Logging per-component breakdowns is the difference between informed optimization and guesswork.

Input Tokens vs. Output Tokens — Asymmetric Economics

Input and output tokens are not interchangeable, and the price differential has profound design implications.

Why output tokens cost more:
- Input tokens are processed in a single, parallelized forward pass through the model.
- Output tokens are generated autoregressively: each token is generated one at a time, and each requires a full forward pass plus KV-cache lookup.
- Output generation is therefore sequential, latency-bound, and compute-intensive per token.
- Typical ratio: output tokens cost 3× to 5× more than input tokens on the same model.

The design implication: It is almost always cheaper to provide a large, rich input and ask for a compact, structured output than to provide a lean input and let the model "think out loud" in an unstructured response.

Prompt: "Tell me about this bug."
Response: 800 tokens of free-form narrative analysis

Prompt: "Analyze this bug report. Return a JSON with these fields:
  - root_cause (≤20 words)
  - severity (P0/P1/P2/P3)
  - affected_components (list)
  - suggested_fix (≤30 words)
  - reproduction_steps (list, ≤5 items)

Bug report: [full context]"
Response: 150 tokens of structured JSON

This swap reduces output token cost by ~81% while delivering more useful information in less time. The model is doing the same cognitive work, but the output is compact and machine-parseable.

Calculating the cost impact of output compression:

def compare_output_strategies(
    input_tokens: int,
    verbose_output_tokens: int,
    structured_output_tokens: int,
    model_input_price: float = 3.00,    # $/M tokens (Claude Sonnet)
    model_output_price: float = 15.00,  # $/M tokens
    requests_per_month: int = 500
) -> None:
    def monthly_cost(out_tokens):
        per_request = (input_tokens / 1e6 * model_input_price 
                      + out_tokens / 1e6 * model_output_price)
        return per_request * requests_per_month

    verbose = monthly_cost(verbose_output_tokens)
    structured = monthly_cost(structured_output_tokens)
    savings = verbose - structured
    pct = savings / verbose * 100

    print(f"Verbose output ({verbose_output_tokens} tokens/req):    ${verbose:.2f}/month")
    print(f"Structured output ({structured_output_tokens} tokens/req): ${structured:.2f}/month")
    print(f"Monthly savings: ${savings:.2f} ({pct:.1f}% reduction)")

compare_output_strategies(
    input_tokens=3000,
    verbose_output_tokens=800,
    structured_output_tokens=150
)

Tip: Add explicit output length constraints to every agentic prompt. Instead of letting the model decide how much to write, tell it: "Respond in under 200 words," or "Return only a JSON object with these keys," or "Limit your response to 3 bullet points." This single discipline — constraining output — is the highest-leverage prompt optimization available.

Designing a Token Budget Framework

A token budget framework is a set of policies that govern how many tokens may be used per task, per session, and per workflow. Without it, token usage grows unchecked as agents handle increasingly complex tasks.

Three budget tiers:

1. Per-request budget — Maximum tokens for a single LLM API call (input + output combined). Prevents single runaway requests.

2. Per-session budget — Maximum tokens across all requests in one agent session. Prevents long-running agents from accumulating unbounded cost.

3. Per-workflow budget — Maximum tokens for a complete business workflow (e.g., "analyze this sprint"). Sets expectations with stakeholders and enables ROI calculations.

Implementing a per-request budget enforcer:

from dataclasses import dataclass, field
from typing import Callable
import anthropic

@dataclass
class TokenBudget:
    max_input_tokens: int = 50_000
    max_output_tokens: int = 4_096
    max_session_tokens: int = 200_000
    session_tokens_used: int = field(default=0, init=False)

    def check_request(self, estimated_input: int) -> None:
        """Raise if this request would exceed limits."""
        if estimated_input > self.max_input_tokens:
            raise ValueError(
                f"Input exceeds budget: {estimated_input} > {self.max_input_tokens}"
            )
        projected_session = self.session_tokens_used + estimated_input + self.max_output_tokens
        if projected_session > self.max_session_tokens:
            raise ValueError(
                f"Session budget exhausted: {projected_session} > {self.max_session_tokens}"
            )

    def record_usage(self, input_tokens: int, output_tokens: int) -> None:
        self.session_tokens_used += input_tokens + output_tokens

    @property
    def session_remaining(self) -> int:
        return self.max_session_tokens - self.session_tokens_used

class BudgetedAgent:
    def __init__(self, budget: TokenBudget, system_prompt: str):
        self.budget = budget
        self.system = system_prompt
        self.client = anthropic.Anthropic()
        self.history = []

    def send(self, user_message: str) -> str:
        import tiktoken  # approximate token count before sending
        enc = tiktoken.get_encoding("cl100k_base")

        # Estimate input size
        history_tokens = sum(len(enc.encode(m["content"])) + 4 for m in self.history)
        new_tokens = len(enc.encode(user_message)) + 4
        system_tokens = len(enc.encode(self.system))
        estimated_input = history_tokens + new_tokens + system_tokens

        # Budget gate
        self.budget.check_request(estimated_input)

        self.history.append({"role": "user", "content": user_message})

        response = self.client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=self.budget.max_output_tokens,
            system=self.system,
            messages=self.history
        )

        output = response.content[0].text
        self.budget.record_usage(
            response.usage.input_tokens,
            response.usage.output_tokens
        )
        self.history.append({"role": "assistant", "content": output})

        print(f"[Budget] Session: {self.budget.session_tokens_used:,} / "
              f"{self.budget.max_session_tokens:,} tokens used "
              f"({self.budget.session_tokens_used/self.budget.max_session_tokens*100:.1f}%)")

        return output

budget = TokenBudget(
    max_input_tokens=30_000,
    max_output_tokens=2_048,
    max_session_tokens=100_000
)
agent = BudgetedAgent(budget, "You are a QA engineer assistant. Be concise.")
response = agent.send("List the test cases needed for a login feature with MFA.")

Tip: For product managers: define token budgets as part of the agent's acceptance criteria, not as an afterthought. Specify in your user stories: "The code review agent must complete a full PR review in under 15,000 tokens." This creates a measurable, testable definition of efficiency and prevents scope creep in agent design.

Prompt Caching Economics — The Input Token Multiplier

Prompt caching is the single most impactful optimization for input token costs in multi-turn agentic workflows. Both Anthropic and OpenAI offer it, and the mechanics are slightly different.

How Anthropic prompt caching works:

Cache breakpoints are explicitly marked in the API request. Content before a cache_control: {"type": "ephemeral"} breakpoint is cached for 5 minutes (extended to 1 hour on Beta). On cache hit, you pay ~10% of the normal input price for those tokens.

import anthropic

client = anthropic.Anthropic()

system_prompt = """[2,000 tokens of detailed instructions, 
tool usage guidelines, output formats, examples...]"""

tools = [
    # 8 tools × ~150 tokens each = 1,200 tokens of tool schemas
]

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}  # Mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "Analyze this test failure log: [...]"}
    ]
)

usage = response.usage
print(f"Input tokens:         {usage.input_tokens}")
print(f"Cache read tokens:    {usage.cache_read_input_tokens}")   # Paid at 10%
print(f"Cache creation:       {usage.cache_creation_input_tokens}")  # Paid at 125%
print(f"Output tokens:        {usage.output_tokens}")

Caching ROI calculation:

def cache_roi(
    stable_tokens: int,    # tokens in the cached portion
    requests: int,         # total requests in the session/day
    input_price: float = 3.00,    # $/M tokens
    cache_write_price: float = 3.75,   # 125% of input
    cache_read_price: float = 0.30,    # 10% of input
) -> None:
    # Without caching: pay full price every time
    no_cache_cost = requests * stable_tokens / 1e6 * input_price

    # With caching: pay write once, read at 10% for rest
    cache_cost = (stable_tokens / 1e6 * cache_write_price  # first request
                 + (requests - 1) * stable_tokens / 1e6 * cache_read_price)

    savings = no_cache_cost - cache_cost
    print(f"Without caching:  ${no_cache_cost:.4f}")
    print(f"With caching:     ${cache_cost:.4f}")
    print(f"Savings:          ${savings:.4f} ({savings/no_cache_cost*100:.1f}%)")

cache_roi(stable_tokens=3_200, requests=50)

An 89% reduction in input costs for the stable portion is achievable with prompt caching. Over a month (50 req/day × 30 days = 1,500 requests), this translates from $14.40 to $1.58 for system prompt costs alone.

Tip: Always cache your system prompt and tool definitions in production agentic systems. Place cache breakpoints at the end of stable content sections. If your system prompt is updated infrequently (weekly or less), the amortized cache-write overhead is negligible, and the read savings compound dramatically at scale.

Extended Thinking and Reasoning Tokens — The Premium Budget Item

Models with extended reasoning capabilities (Claude's "extended thinking" mode, OpenAI's o1/o3/o4 series) introduce a new token category: thinking tokens or reasoning tokens. These are tokens the model generates internally as part of its chain-of-thought process before producing the final visible response.

What makes reasoning tokens different:
- They are billed at the same or higher rate as output tokens.
- They are not visible in the response by default (in most implementations).
- They can be dramatically larger than the final output — it is common for a model to generate 10,000+ thinking tokens to produce a 200-token answer.
- They cannot be meaningfully compressed or constrained without impacting answer quality.

When reasoning tokens are worth the cost:
- Complex multi-step problems with many constraints (system design, security analysis)
- Tasks where accuracy is more valuable than cost (financial analysis, production incident diagnosis)
- Problems where the naive approach gives wrong answers (algorithmic reasoning, logical deductions)

When to avoid reasoning tokens:
- Routine, templated tasks (generating test names, formatting output)
- High-volume, low-complexity tasks (summarizing short documents, classifying items)
- Any task where a non-reasoning model achieves acceptable quality

import anthropic

client = anthropic.Anthropic()

def run_with_thinking(prompt: str, budget_tokens: int = 8000) -> dict:
    """
    Run with extended thinking. budget_tokens controls the thinking budget.
    Lower budget = faster + cheaper but potentially lower quality on hard problems.
    """
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=16000,
        thinking={
            "type": "enabled",
            "budget_tokens": budget_tokens  # Cap thinking tokens
        },
        messages=[{"role": "user", "content": prompt}]
    )

    thinking_tokens = sum(
        len(block.thinking) // 4  # approximate
        for block in response.content 
        if block.type == "thinking"
    )

    return {
        "response": next(b.text for b in response.content if b.type == "text"),
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "thinking_budget": budget_tokens,
    }

result = run_with_thinking(
    "Design a token-efficient caching strategy for an agent that processes 1,000 "
    "customer support tickets per hour. Consider: cache invalidation, cost vs. "
    "freshness trade-offs, and multi-tenant isolation.",
    budget_tokens=5000  # reasonable for a complex design question
)

Tip: Set the thinking budget token limit (budget_tokens for Claude, implicit in o-series model selection for OpenAI) based on task complexity, not habit. A simple summarization task does not need 10,000 thinking tokens. A tier-1 production incident analysis might warrant it. Creating a two-tier routing policy — standard model for routine tasks, reasoning model with capped budget for complex tasks — can cut costs by 40–70% on mixed-complexity workloads.

Token Budget Governance for Teams

Individual engineers optimizing their own agents is insufficient at scale. When a team of 10 engineers is building multiple agentic features, you need organizational token governance.

Key governance policies:

1. Budget-as-code — Token limits are defined in configuration, not hardcoded:

agents:
  code_reviewer:
    max_input_tokens: 40000
    max_output_tokens: 2048
    max_session_tokens: 150000
    model: claude-sonnet-4-5
    cache_system_prompt: true

  test_generator:
    max_input_tokens: 20000
    max_output_tokens: 4096
    max_session_tokens: 80000
    model: claude-haiku-3-5
    cache_system_prompt: true

  architecture_advisor:
    max_input_tokens: 60000
    max_output_tokens: 8192
    max_session_tokens: 300000
    model: claude-opus-4-5
    thinking_budget: 8000
    cache_system_prompt: true

2. Token cost SLOs — Define acceptable cost per unit of work:
- Code review: ≤ $0.05 per PR
- Test case generation: ≤ $0.02 per user story
- Sprint retrospective summary: ≤ $0.10 per sprint

3. Cost attribution — Tag API calls with team, feature, and environment labels so cost can be attributed in billing dashboards.

4. Budget alerts — Configure alerts at 50%, 80%, and 100% of monthly budget thresholds.

Tip: For QA engineers: treat token budget adherence as a non-functional requirement and include token consumption assertions in your agent test suites. A test that verifies "this agent completes this task in under 8,000 tokens" prevents regression as prompts evolve. Use your token profiler (from Topic 1) as a test utility to run these assertions in CI.