Token Budgets & Guardrails | Token Optimization Masterclass

Token budgets are the operational controls that prevent agentic systems from becoming financial liabilities. Without them, a single runaway agent loop, a misconfigured prompt, or an unexpected spike in usage can generate costs that dwarf the entire planned monthly budget. With them, you have predictable costs, early warning systems, and hard stops that protect both your budget and your users' trust.

This topic covers how to design and implement token budgets at three granularities — per-task, per-session, and per-project — and how to build guardrails that enforce those budgets without degrading the user experience.

The Three Budget Granularities

Token budgets operate at different scopes, each serving a different purpose:

Per-task budgets cap the tokens consumed by a single agent invocation. They prevent infinite loops, excessively verbose outputs, and runaway tool call chains. This is your most granular and most operationally critical control.

Per-session budgets cap the tokens consumed across an interactive session (e.g., a user's conversation with a coding assistant). They prevent users from inadvertently consuming disproportionate resources, and they force good context hygiene.

Per-project budgets cap the total tokens consumed by a feature, team, or use case over a billing period (day/week/month). They are your organizational budget control — the mechanism that makes token spend a managed cost rather than an open liability.

These granularities stack: a per-task budget catches immediate runaway, a per-session budget catches patterns of overuse, and a per-project budget enforces organizational resource allocation.

Tip: Start by setting per-task budgets first. They're the easiest to reason about, immediately reduce catastrophic risk, and force you to think precisely about what each task actually needs. Once per-task budgets are in place, per-session and per-project limits become straightforward to calibrate from real data.

Designing Per-Task Token Budgets

A per-task token budget has two components: a soft limit (a warning threshold) and a hard limit (a stop condition). Setting them correctly requires understanding the task's normal token consumption profile.

Step 1: Baseline Your Tasks

Before setting budgets, measure actual consumption for each task type. Run 50–100 representative examples and record:

Task: "Generate unit tests for a Python function"

Sample measurements (input + output tokens):
  P50 (median):    2,800 tokens
  P75:             3,900 tokens
  P90:             5,200 tokens
  P95:             6,800 tokens
  P99:             9,400 tokens
  Max observed:   14,200 tokens (runaway verbosity)

Budget recommendation:
  Soft limit: P90 × 1.2 = 6,240 tokens (warn but continue)
  Hard limit: P99 × 1.5 = 14,100 tokens (stop with graceful message)

Setting soft limits at P90 × 1.2 gives a 20% buffer above normal maximum, catching outliers without false positives. Setting hard limits at P99 × 1.5 stops genuinely runaway cases while allowing legitimate edge cases through.

Step 2: Implement Budget Enforcement

import anthropic
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TaskBudget:
    soft_limit_tokens: int
    hard_limit_tokens: int
    tokens_used: int = field(default=0)
    soft_limit_hit: bool = field(default=False)
    hard_limit_hit: bool = field(default=False)

    def record_usage(self, input_tokens: int, output_tokens: int) -> None:
        self.tokens_used += input_tokens + output_tokens

        if self.tokens_used >= self.soft_limit_tokens and not self.soft_limit_hit:
            self.soft_limit_hit = True
            print(f"[BUDGET WARNING] Soft limit reached: {self.tokens_used}/{self.soft_limit_tokens}")

        if self.tokens_used >= self.hard_limit_tokens:
            self.hard_limit_hit = True
            raise BudgetExceededError(
                f"Hard limit exceeded: {self.tokens_used}/{self.hard_limit_tokens} tokens"
            )

class BudgetExceededError(Exception):
    pass

TASK_BUDGETS = {
    "unit_test_generation": TaskBudget(
        soft_limit_tokens=6_240,
        hard_limit_tokens=14_100
    ),
    "code_review": TaskBudget(
        soft_limit_tokens=4_800,
        hard_limit_tokens=10_000
    ),
    "bug_analysis": TaskBudget(
        soft_limit_tokens=8_000,
        hard_limit_tokens=16_000
    ),
    "documentation_generation": TaskBudget(
        soft_limit_tokens=5_000,
        hard_limit_tokens=12_000
    ),
    "default": TaskBudget(
        soft_limit_tokens=5_000,
        hard_limit_tokens=15_000
    )
}

async def run_task_with_budget(
    task_type: str,
    prompt: str,
    model: str = "claude-3-5-haiku-20241022"
) -> str:
    budget = TASK_BUDGETS.get(task_type, TASK_BUDGETS["default"])
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": prompt}]

    # Set max_tokens to hard limit to prevent exceeding it at the API level
    max_output_tokens = min(4096, budget.hard_limit_tokens - len(prompt.split()) * 1.3)

    try:
        response = client.messages.create(
            model=model,
            max_tokens=int(max_output_tokens),
            messages=messages
        )

        budget.record_usage(
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens
        )

        return response.content[0].text

    except BudgetExceededError as e:
        return f"Task stopped: {str(e)}. Partial results may be incomplete."

Step 3: The max_tokens Parameter as a Hard Stop

The most reliable per-task guardrail is the API-level max_tokens parameter. Setting this correctly prevents the model from generating beyond your budget regardless of application logic:

available_output_tokens = hard_limit - estimated_input_tokens

response = client.messages.create(
    model=model,
    max_tokens=1500,  # Hard stop at API level
    messages=messages
)

Always set max_tokens — never rely on the model to self-limit.

Tip: When a hard limit is hit, return a structured partial response rather than an error message. For example: "Analysis complete for the first 3 functions. Budget limit reached before completing functions 4–7. To analyze those, start a new task." This preserves user trust and gives a clear path forward.

Designing Per-Session Token Budgets

Sessions are interactive contexts where a user (or automated workflow) accumulates conversation history across multiple turns. Per-session budgets require tracking cumulative usage and implementing context management strategies before the limit is hit.

Session Budget Architecture

import time
from dataclasses import dataclass, field
from typing import List, Dict

@dataclass
class SessionState:
    session_id: str
    user_id: str
    started_at: float = field(default_factory=time.time)

    # Token tracking
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    turn_count: int = 0

    # Budget limits
    soft_limit_tokens: int = 50_000   # ~$0.15 at Claude 3.5 Haiku pricing
    hard_limit_tokens: int = 100_000  # ~$0.30 per session

    # Context management
    conversation_history: List[Dict] = field(default_factory=list)

    @property
    def total_tokens(self) -> int:
        return self.total_input_tokens + self.total_output_tokens

    @property
    def remaining_budget(self) -> int:
        return self.hard_limit_tokens - self.total_tokens

    @property
    def budget_percentage_used(self) -> float:
        return (self.total_tokens / self.hard_limit_tokens) * 100

    def record_turn(self, input_tokens: int, output_tokens: int) -> None:
        self.total_input_tokens += input_tokens
        self.total_output_tokens += output_tokens
        self.turn_count += 1

    def should_compress_context(self) -> bool:
        """Trigger context compression when approaching soft limit."""
        return self.budget_percentage_used >= 60

    def is_near_hard_limit(self) -> bool:
        return self.budget_percentage_used >= 85

    def is_over_hard_limit(self) -> bool:
        return self.total_tokens >= self.hard_limit_tokens

Context Compression at Budget Thresholds

When a session approaches its soft limit, compress conversation history rather than terminating:

async def compress_session_context(
    session: SessionState,
    client: anthropic.Anthropic
) -> str:
    """
    Summarize older conversation turns to free up context budget.
    Keep recent N turns verbatim, summarize the rest.
    """
    RECENT_TURNS_TO_KEEP = 4

    if len(session.conversation_history) <= RECENT_TURNS_TO_KEEP:
        return  # Nothing to compress

    older_turns = session.conversation_history[:-RECENT_TURNS_TO_KEEP]
    recent_turns = session.conversation_history[-RECENT_TURNS_TO_KEEP:]

    # Summarize older turns
    summary_prompt = f"""Summarize the following conversation history concisely.
    Focus on: key decisions made, important context established, 
    and any unresolved questions.
    Keep the summary under 200 words.

    Conversation:
    {format_turns(older_turns)}"""

    summary_response = client.messages.create(
        model="claude-3-haiku-20240307",  # Use cheap model for summarization
        max_tokens=300,
        messages=[{"role": "user", "content": summary_prompt}]
    )

    summary = summary_response.content[0].text

    # Replace history with summary + recent turns
    session.conversation_history = [
        {"role": "system", "content": f"[Earlier conversation summary: {summary}]"}
    ] + recent_turns

    print(f"[SESSION] Compressed {len(older_turns)} turns into summary. "
          f"Budget: {session.budget_percentage_used:.1f}% used")

Tip: Communicate session budget status to users proactively. A subtle indicator like "Session: 45% of budget used" helps users self-regulate and makes budget limits feel like a feature rather than a surprise wall. Users who understand their budget make better choices about when to start new sessions.

Designing Per-Project Token Budgets

Per-project (or per-team) budgets operate at the organizational level. They ensure that the sum of all sessions and tasks within a project stays within allocated limits over a billing period.

Budget Allocation Model

Monthly Token Budget Allocation Example:

Total monthly AI budget: $500

Project allocations:
  CI/CD automation agent:     $150/month  (30%)
  Developer coding assistant: $200/month  (40%)
  QA test generation:         $100/month  (20%)
  PM requirements drafting:    $50/month  (10%)

Per-project daily rolling limits:
  CI/CD:         $5.00/day   (150/30)
  Dev assistant: $6.67/day
  QA:            $3.33/day
  PM:            $1.67/day

Budget Enforcement with Redis-Backed Rate Limiting

import redis
import json
from datetime import datetime, timedelta

class ProjectBudgetEnforcer:
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

        # Monthly budget limits (in USD)
        self.project_limits = {
            "ci_cd_agent": {"monthly": 150.0, "daily": 5.0},
            "dev_assistant": {"monthly": 200.0, "daily": 6.67},
            "qa_test_gen": {"monthly": 100.0, "daily": 3.33},
            "pm_assistant": {"monthly": 50.0, "daily": 1.67},
        }

        # Cost per token (USD per token)
        self.cost_per_input_token = 0.0000008   # Claude 3.5 Haiku
        self.cost_per_output_token = 0.000004

    def calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
        return (input_tokens * self.cost_per_input_token + 
                output_tokens * self.cost_per_output_token)

    def get_budget_key(self, project_id: str, period: str) -> str:
        if period == "daily":
            date = datetime.utcnow().strftime("%Y-%m-%d")
            return f"budget:{project_id}:daily:{date}"
        elif period == "monthly":
            month = datetime.utcnow().strftime("%Y-%m")
            return f"budget:{project_id}:monthly:{month}"

    def check_and_record_spend(
        self, 
        project_id: str, 
        input_tokens: int, 
        output_tokens: int
    ) -> dict:
        cost = self.calculate_cost(input_tokens, output_tokens)
        limits = self.project_limits.get(project_id, {"monthly": 10.0, "daily": 0.33})

        daily_key = self.get_budget_key(project_id, "daily")
        monthly_key = self.get_budget_key(project_id, "monthly")

        # Atomic check-and-increment using Redis pipeline
        pipe = self.redis.pipeline()
        pipe.incrbyfloat(daily_key, cost)
        pipe.expire(daily_key, 86400 * 2)  # Expire after 2 days
        pipe.incrbyfloat(monthly_key, cost)
        pipe.expire(monthly_key, 86400 * 35)  # Expire after 35 days
        results = pipe.execute()

        daily_spend = results[0]
        monthly_spend = results[2]

        status = {
            "allowed": True,
            "cost": cost,
            "daily_spend": daily_spend,
            "monthly_spend": monthly_spend,
            "daily_limit": limits["daily"],
            "monthly_limit": limits["monthly"],
            "daily_percent": (daily_spend / limits["daily"]) * 100,
            "monthly_percent": (monthly_spend / limits["monthly"]) * 100,
        }

        # Hard limit: roll back if exceeded (undo the increment)
        if daily_spend > limits["daily"]:
            self.redis.incrbyfloat(daily_key, -cost)
            status["allowed"] = False
            status["reason"] = "daily_limit_exceeded"
        elif monthly_spend > limits["monthly"]:
            self.redis.incrbyfloat(monthly_key, -cost)
            status["allowed"] = False
            status["reason"] = "monthly_limit_exceeded"

        return status

Tip: Set your monthly limits at 80% of your actual budget, reserving 20% as a buffer for unexpected spikes. Then treat the 80% limit as your "real" budget in all planning and alerting. This buffer has saved teams from going over budget more often than any other single practice.

Guardrail Patterns for Common Failure Modes

Infinite Loop Detection

Agentic loops can run indefinitely if termination conditions are poorly defined:

class LoopGuardrail:
    def __init__(self, max_steps: int = 20, max_tokens: int = 50_000):
        self.max_steps = max_steps
        self.max_tokens = max_tokens
        self.steps_taken = 0
        self.tokens_used = 0

    def check(self, step_tokens: int) -> None:
        self.steps_taken += 1
        self.tokens_used += step_tokens

        if self.steps_taken >= self.max_steps:
            raise LoopGuardrailError(
                f"Max steps ({self.max_steps}) exceeded. "
                "Agent may be in an infinite loop."
            )

        if self.tokens_used >= self.max_tokens:
            raise LoopGuardrailError(
                f"Max tokens ({self.max_tokens}) exceeded across {self.steps_taken} steps."
            )

Output Length Guardrails

Some tasks generate outputs that are longer than useful. Add output length validation:

def validate_output_length(response_text: str, task_type: str) -> str:
    MAX_LENGTHS = {
        "code_review": 3000,
        "unit_tests": 5000,
        "summary": 500,
        "classification": 100,
    }

    max_len = MAX_LENGTHS.get(task_type, 2000)

    if len(response_text) > max_len * 4:  # chars ≈ tokens × 4
        # Truncate with notice
        truncated = response_text[:max_len * 4]
        return truncated + f"\n\n[Output truncated at {max_len} token equivalent. Full analysis available on request.]"

    return response_text

Tip: Treat guardrail triggers as signals, not just stops. Log every guardrail activation with the task context and token count. A guardrail that fires more than 5% of the time is a signal that your budget or prompt design needs adjustment — either the budget is too tight, or the prompt is inducing unnecessarily verbose responses.

Budget Visualization and Reporting

Budgets are only useful if stakeholders can see them. Build simple reporting into your budget system:

def generate_budget_report(project_id: str, enforcer: ProjectBudgetEnforcer) -> str:
    daily_key = enforcer.get_budget_key(project_id, "daily")
    monthly_key = enforcer.get_budget_key(project_id, "monthly")

    daily_spend = float(enforcer.redis.get(daily_key) or 0)
    monthly_spend = float(enforcer.redis.get(monthly_key) or 0)
    limits = enforcer.project_limits[project_id]

    report = f"""
Budget Report: {project_id}
Generated: {datetime.utcnow().strftime('%Y-%m-%d %H:%M UTC')}

Daily Budget:
  Spent:     ${daily_spend:.4f} of ${limits['daily']:.2f}
  Remaining: ${limits['daily'] - daily_spend:.4f}
  Used:      {(daily_spend/limits['daily']*100):.1f}%
  Status:    {'OK' if daily_spend < limits['daily'] * 0.8 else 'WARNING' if daily_spend < limits['daily'] else 'EXCEEDED'}

Monthly Budget:
  Spent:     ${monthly_spend:.4f} of ${limits['monthly']:.2f}
  Remaining: ${limits['monthly'] - monthly_spend:.4f}
  Used:      {(monthly_spend/limits['monthly']*100):.1f}%
  Status:    {'OK' if monthly_spend < limits['monthly'] * 0.8 else 'WARNING' if monthly_spend < limits['monthly'] else 'EXCEEDED'}
"""
    return report

Tip: Schedule daily budget summary emails or Slack messages to team leads, showing each project's spend-to-date vs. budget. Teams that see their budget consumption daily adjust behavior naturally and rarely hit hard limits. Teams that only see it on the invoice get surprised every month.

Summary

Effective token budgeting requires three layers: per-task hard limits enforced at the API level via max_tokens, per-session compression and soft limits that degrade gracefully, and per-project budget enforcement with Redis-backed counters that prevent organizational overspend. Guardrails against infinite loops and output verbosity catch the failure modes budgets alone cannot prevent. Visibility — through regular reporting and proactive user communication — transforms budget controls from emergency brakes into team-wide cost culture.