Token budgets are the operational controls that prevent agentic systems from becoming financial liabilities. Without them, a single runaway agent loop, a misconfigured prompt, or an unexpected spike in usage can generate costs that dwarf the entire planned monthly budget. With them, you have predictable costs, early warning systems, and hard stops that protect both your budget and your users' trust.
This topic covers how to design and implement token budgets at three granularities — per-task, per-session, and per-project — and how to build guardrails that enforce those budgets without degrading the user experience.
The Three Budget Granularities
Token budgets operate at different scopes, each serving a different purpose:
Per-task budgets cap the tokens consumed by a single agent invocation. They prevent infinite loops, excessively verbose outputs, and runaway tool call chains. This is your most granular and most operationally critical control.
Per-session budgets cap the tokens consumed across an interactive session (e.g., a user's conversation with a coding assistant). They prevent users from inadvertently consuming disproportionate resources, and they force good context hygiene.
Per-project budgets cap the total tokens consumed by a feature, team, or use case over a billing period (day/week/month). They are your organizational budget control — the mechanism that makes token spend a managed cost rather than an open liability.
These granularities stack: a per-task budget catches immediate runaway, a per-session budget catches patterns of overuse, and a per-project budget enforces organizational resource allocation.
Tip: Start by setting per-task budgets first. They're the easiest to reason about, immediately reduce catastrophic risk, and force you to think precisely about what each task actually needs. Once per-task budgets are in place, per-session and per-project limits become straightforward to calibrate from real data.
Designing Per-Task Token Budgets
A per-task token budget has two components: a soft limit (a warning threshold) and a hard limit (a stop condition). Setting them correctly requires understanding the task's normal token consumption profile.
Step 1: Baseline Your Tasks
Before setting budgets, measure actual consumption for each task type. Run 50–100 representative examples and record:
Task: "Generate unit tests for a Python function"
Sample measurements (input + output tokens):
P50 (median): 2,800 tokens
P75: 3,900 tokens
P90: 5,200 tokens
P95: 6,800 tokens
P99: 9,400 tokens
Max observed: 14,200 tokens (runaway verbosity)
Budget recommendation:
Soft limit: P90 × 1.2 = 6,240 tokens (warn but continue)
Hard limit: P99 × 1.5 = 14,100 tokens (stop with graceful message)
Setting soft limits at P90 × 1.2 gives a 20% buffer above normal maximum, catching outliers without false positives. Setting hard limits at P99 × 1.5 stops genuinely runaway cases while allowing legitimate edge cases through.
Step 2: Implement Budget Enforcement
import anthropic
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class TaskBudget:
soft_limit_tokens: int
hard_limit_tokens: int
tokens_used: int = field(default=0)
soft_limit_hit: bool = field(default=False)
hard_limit_hit: bool = field(default=False)
def record_usage(self, input_tokens: int, output_tokens: int) -> None:
self.tokens_used += input_tokens + output_tokens
if self.tokens_used >= self.soft_limit_tokens and not self.soft_limit_hit:
self.soft_limit_hit = True
print(f"[BUDGET WARNING] Soft limit reached: {self.tokens_used}/{self.soft_limit_tokens}")
if self.tokens_used >= self.hard_limit_tokens:
self.hard_limit_hit = True
raise BudgetExceededError(
f"Hard limit exceeded: {self.tokens_used}/{self.hard_limit_tokens} tokens"
)
class BudgetExceededError(Exception):
pass
TASK_BUDGETS = {
"unit_test_generation": TaskBudget(
soft_limit_tokens=6_240,
hard_limit_tokens=14_100
),
"code_review": TaskBudget(
soft_limit_tokens=4_800,
hard_limit_tokens=10_000
),
"bug_analysis": TaskBudget(
soft_limit_tokens=8_000,
hard_limit_tokens=16_000
),
"documentation_generation": TaskBudget(
soft_limit_tokens=5_000,
hard_limit_tokens=12_000
),
"default": TaskBudget(
soft_limit_tokens=5_000,
hard_limit_tokens=15_000
)
}
async def run_task_with_budget(
task_type: str,
prompt: str,
model: str = "claude-3-5-haiku-20241022"
) -> str:
budget = TASK_BUDGETS.get(task_type, TASK_BUDGETS["default"])
client = anthropic.Anthropic()
messages = [{"role": "user", "content": prompt}]
# Set max_tokens to hard limit to prevent exceeding it at the API level
max_output_tokens = min(4096, budget.hard_limit_tokens - len(prompt.split()) * 1.3)
try:
response = client.messages.create(
model=model,
max_tokens=int(max_output_tokens),
messages=messages
)
budget.record_usage(
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens
)
return response.content[0].text
except BudgetExceededError as e:
return f"Task stopped: {str(e)}. Partial results may be incomplete."
Step 3: The max_tokens Parameter as a Hard Stop
The most reliable per-task guardrail is the API-level max_tokens parameter. Setting this correctly prevents the model from generating beyond your budget regardless of application logic:
available_output_tokens = hard_limit - estimated_input_tokens
response = client.messages.create(
model=model,
max_tokens=1500, # Hard stop at API level
messages=messages
)
Always set max_tokens — never rely on the model to self-limit.
Tip: When a hard limit is hit, return a structured partial response rather than an error message. For example: "Analysis complete for the first 3 functions. Budget limit reached before completing functions 4–7. To analyze those, start a new task." This preserves user trust and gives a clear path forward.
Designing Per-Session Token Budgets
Sessions are interactive contexts where a user (or automated workflow) accumulates conversation history across multiple turns. Per-session budgets require tracking cumulative usage and implementing context management strategies before the limit is hit.
Session Budget Architecture
import time
from dataclasses import dataclass, field
from typing import List, Dict
@dataclass
class SessionState:
session_id: str
user_id: str
started_at: float = field(default_factory=time.time)
# Token tracking
total_input_tokens: int = 0
total_output_tokens: int = 0
turn_count: int = 0
# Budget limits
soft_limit_tokens: int = 50_000 # ~$0.15 at Claude 3.5 Haiku pricing
hard_limit_tokens: int = 100_000 # ~$0.30 per session
# Context management
conversation_history: List[Dict] = field(default_factory=list)
@property
def total_tokens(self) -> int:
return self.total_input_tokens + self.total_output_tokens
@property
def remaining_budget(self) -> int:
return self.hard_limit_tokens - self.total_tokens
@property
def budget_percentage_used(self) -> float:
return (self.total_tokens / self.hard_limit_tokens) * 100
def record_turn(self, input_tokens: int, output_tokens: int) -> None:
self.total_input_tokens += input_tokens
self.total_output_tokens += output_tokens
self.turn_count += 1
def should_compress_context(self) -> bool:
"""Trigger context compression when approaching soft limit."""
return self.budget_percentage_used >= 60
def is_near_hard_limit(self) -> bool:
return self.budget_percentage_used >= 85
def is_over_hard_limit(self) -> bool:
return self.total_tokens >= self.hard_limit_tokens
Context Compression at Budget Thresholds
When a session approaches its soft limit, compress conversation history rather than terminating:
async def compress_session_context(
session: SessionState,
client: anthropic.Anthropic
) -> str:
"""
Summarize older conversation turns to free up context budget.
Keep recent N turns verbatim, summarize the rest.
"""
RECENT_TURNS_TO_KEEP = 4
if len(session.conversation_history) <= RECENT_TURNS_TO_KEEP:
return # Nothing to compress
older_turns = session.conversation_history[:-RECENT_TURNS_TO_KEEP]
recent_turns = session.conversation_history[-RECENT_TURNS_TO_KEEP:]
# Summarize older turns
summary_prompt = f"""Summarize the following conversation history concisely.
Focus on: key decisions made, important context established,
and any unresolved questions.
Keep the summary under 200 words.
Conversation:
{format_turns(older_turns)}"""
summary_response = client.messages.create(
model="claude-3-haiku-20240307", # Use cheap model for summarization
max_tokens=300,
messages=[{"role": "user", "content": summary_prompt}]
)
summary = summary_response.content[0].text
# Replace history with summary + recent turns
session.conversation_history = [
{"role": "system", "content": f"[Earlier conversation summary: {summary}]"}
] + recent_turns
print(f"[SESSION] Compressed {len(older_turns)} turns into summary. "
f"Budget: {session.budget_percentage_used:.1f}% used")
Tip: Communicate session budget status to users proactively. A subtle indicator like "Session: 45% of budget used" helps users self-regulate and makes budget limits feel like a feature rather than a surprise wall. Users who understand their budget make better choices about when to start new sessions.
Designing Per-Project Token Budgets
Per-project (or per-team) budgets operate at the organizational level. They ensure that the sum of all sessions and tasks within a project stays within allocated limits over a billing period.
Budget Allocation Model
Monthly Token Budget Allocation Example:
Total monthly AI budget: $500
Project allocations:
CI/CD automation agent: $150/month (30%)
Developer coding assistant: $200/month (40%)
QA test generation: $100/month (20%)
PM requirements drafting: $50/month (10%)
Per-project daily rolling limits:
CI/CD: $5.00/day (150/30)
Dev assistant: $6.67/day
QA: $3.33/day
PM: $1.67/day
Budget Enforcement with Redis-Backed Rate Limiting
import redis
import json
from datetime import datetime, timedelta
class ProjectBudgetEnforcer:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
# Monthly budget limits (in USD)
self.project_limits = {
"ci_cd_agent": {"monthly": 150.0, "daily": 5.0},
"dev_assistant": {"monthly": 200.0, "daily": 6.67},
"qa_test_gen": {"monthly": 100.0, "daily": 3.33},
"pm_assistant": {"monthly": 50.0, "daily": 1.67},
}
# Cost per token (USD per token)
self.cost_per_input_token = 0.0000008 # Claude 3.5 Haiku
self.cost_per_output_token = 0.000004
def calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
return (input_tokens * self.cost_per_input_token +
output_tokens * self.cost_per_output_token)
def get_budget_key(self, project_id: str, period: str) -> str:
if period == "daily":
date = datetime.utcnow().strftime("%Y-%m-%d")
return f"budget:{project_id}:daily:{date}"
elif period == "monthly":
month = datetime.utcnow().strftime("%Y-%m")
return f"budget:{project_id}:monthly:{month}"
def check_and_record_spend(
self,
project_id: str,
input_tokens: int,
output_tokens: int
) -> dict:
cost = self.calculate_cost(input_tokens, output_tokens)
limits = self.project_limits.get(project_id, {"monthly": 10.0, "daily": 0.33})
daily_key = self.get_budget_key(project_id, "daily")
monthly_key = self.get_budget_key(project_id, "monthly")
# Atomic check-and-increment using Redis pipeline
pipe = self.redis.pipeline()
pipe.incrbyfloat(daily_key, cost)
pipe.expire(daily_key, 86400 * 2) # Expire after 2 days
pipe.incrbyfloat(monthly_key, cost)
pipe.expire(monthly_key, 86400 * 35) # Expire after 35 days
results = pipe.execute()
daily_spend = results[0]
monthly_spend = results[2]
status = {
"allowed": True,
"cost": cost,
"daily_spend": daily_spend,
"monthly_spend": monthly_spend,
"daily_limit": limits["daily"],
"monthly_limit": limits["monthly"],
"daily_percent": (daily_spend / limits["daily"]) * 100,
"monthly_percent": (monthly_spend / limits["monthly"]) * 100,
}
# Hard limit: roll back if exceeded (undo the increment)
if daily_spend > limits["daily"]:
self.redis.incrbyfloat(daily_key, -cost)
status["allowed"] = False
status["reason"] = "daily_limit_exceeded"
elif monthly_spend > limits["monthly"]:
self.redis.incrbyfloat(monthly_key, -cost)
status["allowed"] = False
status["reason"] = "monthly_limit_exceeded"
return status
Tip: Set your monthly limits at 80% of your actual budget, reserving 20% as a buffer for unexpected spikes. Then treat the 80% limit as your "real" budget in all planning and alerting. This buffer has saved teams from going over budget more often than any other single practice.
Guardrail Patterns for Common Failure Modes
Infinite Loop Detection
Agentic loops can run indefinitely if termination conditions are poorly defined:
class LoopGuardrail:
def __init__(self, max_steps: int = 20, max_tokens: int = 50_000):
self.max_steps = max_steps
self.max_tokens = max_tokens
self.steps_taken = 0
self.tokens_used = 0
def check(self, step_tokens: int) -> None:
self.steps_taken += 1
self.tokens_used += step_tokens
if self.steps_taken >= self.max_steps:
raise LoopGuardrailError(
f"Max steps ({self.max_steps}) exceeded. "
"Agent may be in an infinite loop."
)
if self.tokens_used >= self.max_tokens:
raise LoopGuardrailError(
f"Max tokens ({self.max_tokens}) exceeded across {self.steps_taken} steps."
)
Output Length Guardrails
Some tasks generate outputs that are longer than useful. Add output length validation:
def validate_output_length(response_text: str, task_type: str) -> str:
MAX_LENGTHS = {
"code_review": 3000,
"unit_tests": 5000,
"summary": 500,
"classification": 100,
}
max_len = MAX_LENGTHS.get(task_type, 2000)
if len(response_text) > max_len * 4: # chars ≈ tokens × 4
# Truncate with notice
truncated = response_text[:max_len * 4]
return truncated + f"\n\n[Output truncated at {max_len} token equivalent. Full analysis available on request.]"
return response_text
Tip: Treat guardrail triggers as signals, not just stops. Log every guardrail activation with the task context and token count. A guardrail that fires more than 5% of the time is a signal that your budget or prompt design needs adjustment — either the budget is too tight, or the prompt is inducing unnecessarily verbose responses.
Budget Visualization and Reporting
Budgets are only useful if stakeholders can see them. Build simple reporting into your budget system:
def generate_budget_report(project_id: str, enforcer: ProjectBudgetEnforcer) -> str:
daily_key = enforcer.get_budget_key(project_id, "daily")
monthly_key = enforcer.get_budget_key(project_id, "monthly")
daily_spend = float(enforcer.redis.get(daily_key) or 0)
monthly_spend = float(enforcer.redis.get(monthly_key) or 0)
limits = enforcer.project_limits[project_id]
report = f"""
Budget Report: {project_id}
Generated: {datetime.utcnow().strftime('%Y-%m-%d %H:%M UTC')}
Daily Budget:
Spent: ${daily_spend:.4f} of ${limits['daily']:.2f}
Remaining: ${limits['daily'] - daily_spend:.4f}
Used: {(daily_spend/limits['daily']*100):.1f}%
Status: {'OK' if daily_spend < limits['daily'] * 0.8 else 'WARNING' if daily_spend < limits['daily'] else 'EXCEEDED'}
Monthly Budget:
Spent: ${monthly_spend:.4f} of ${limits['monthly']:.2f}
Remaining: ${limits['monthly'] - monthly_spend:.4f}
Used: {(monthly_spend/limits['monthly']*100):.1f}%
Status: {'OK' if monthly_spend < limits['monthly'] * 0.8 else 'WARNING' if monthly_spend < limits['monthly'] else 'EXCEEDED'}
"""
return report
Tip: Schedule daily budget summary emails or Slack messages to team leads, showing each project's spend-to-date vs. budget. Teams that see their budget consumption daily adjust behavior naturally and rarely hit hard limits. Teams that only see it on the invoice get surprised every month.
Summary
Effective token budgeting requires three layers: per-task hard limits enforced at the API level via max_tokens, per-session compression and soft limits that degrade gracefully, and per-project budget enforcement with Redis-backed counters that prevent organizational overspend. Guardrails against infinite loops and output verbosity catch the failure modes budgets alone cannot prevent. Visibility — through regular reporting and proactive user communication — transforms budget controls from emergency brakes into team-wide cost culture.