This is a full hands-on lab. You will design, implement, and validate a complete token budget framework for a fictional but realistic software development team. By the end, you will have a working budget system that covers model selection routing, per-task and per-project limits, real-time cost monitoring, alerting, and an ROI dashboard — all integrated into a cohesive framework you can adapt for your own team.
The scenario: Acme Engineering is a 12-person product team (6 engineers, 3 QA engineers, 3 PMs) that runs the following agentic workflows:
- CI/CD pipeline code review agent (automated, runs on every PR)
- Developer coding assistant (interactive, on-demand)
- QA test generation agent (runs on new feature tickets)
- PM requirements drafting assistant (interactive, on-demand)
Their goal: spend no more than $800/month total on AI tokens while maximizing productivity for all three roles.
Phase 1: Map Your Workflows and Estimate Baseline Costs
Before writing any code, document every agentic workflow and estimate its token consumption.
Step 1: Workflow Inventory
Create a workflow inventory document (this is your starting artifact):
ACME ENGINEERING — AI WORKFLOW INVENTORY
1. CI/CD Code Review Agent
Trigger: Every PR opened or pushed
Frequency: ~40 PRs/week, ~160/month
Task: Review code changes for bugs, style, security issues
Avg code diff size: 300 lines ≈ 3,500 input tokens
System prompt: 800 tokens
Tool definitions: 400 tokens
Expected output: 500-800 tokens (review comments)
Model candidate: claude-3-5-haiku-20241022
2. Developer Coding Assistant
Trigger: On-demand, all 6 engineers
Frequency: 12 sessions/engineer/week × 6 = 72 sessions/week = 288/month
Avg session: 8 turns, 1,200 tokens per turn (in+out combined)
System prompt: 600 tokens (stable, cacheable)
Model candidate: claude-3-5-haiku-20241022 (routed to Sonnet for complex)
3. QA Test Generation Agent
Trigger: New feature ticket in Jira
Frequency: 8 features/sprint × 2 sprints/month = 16/month
Task: Generate test plan + test cases from requirements
Avg input: 2,000 tokens (requirements doc + existing tests)
Avg output: 3,000-4,000 tokens (test plan + cases)
Model candidate: claude-3-5-sonnet-20241022
4. PM Requirements Drafting Assistant
Trigger: On-demand, all 3 PMs
Frequency: 4 sessions/PM/week × 3 = 12/week = 48/month
Avg session: 6 turns, 800 tokens per turn
System prompt: 500 tokens
Model candidate: claude-3-5-haiku-20241022
Step 2: Calculate Baseline Cost Estimate
PRICING = {
"claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
"claude-3-5-haiku-20241022": {"input": 0.80, "output": 4.00},
}
workflows = {
"ci_cd_review": {
"model": "claude-3-5-haiku-20241022",
"monthly_tasks": 160,
"avg_input_tokens": 4700, # system + tools + diff
"avg_output_tokens": 650,
"label": "CI/CD Code Review"
},
"dev_assistant": {
"model": "claude-3-5-haiku-20241022",
"monthly_tasks": 288, # sessions
"avg_turns": 8,
"avg_input_per_turn": 750, # grows with history, avg mid-session
"avg_output_per_turn": 450,
"label": "Developer Coding Assistant"
},
"qa_test_gen": {
"model": "claude-3-5-sonnet-20241022",
"monthly_tasks": 16,
"avg_input_tokens": 2000,
"avg_output_tokens": 3500,
"label": "QA Test Generation"
},
"pm_assistant": {
"model": "claude-3-5-haiku-20241022",
"monthly_tasks": 48,
"avg_turns": 6,
"avg_input_per_turn": 550,
"avg_output_per_turn": 250,
"label": "PM Requirements Drafting"
}
}
print("=" * 60)
print("ACME ENGINEERING — Monthly Token Cost Estimate")
print("=" * 60)
total_monthly = 0.0
for workflow_id, wf in workflows.items():
pricing = PRICING[wf["model"]]
if "avg_turns" in wf:
# Session-based workflow
total_input = (wf["monthly_tasks"] * wf["avg_turns"] *
wf["avg_input_per_turn"])
total_output = (wf["monthly_tasks"] * wf["avg_turns"] *
wf["avg_output_per_turn"])
else:
# Single-task workflow
total_input = wf["monthly_tasks"] * wf["avg_input_tokens"]
total_output = wf["monthly_tasks"] * wf["avg_output_tokens"]
input_cost = (total_input / 1_000_000) * pricing["input"]
output_cost = (total_output / 1_000_000) * pricing["output"]
workflow_cost = input_cost + output_cost
total_monthly += workflow_cost
print(f"\n{wf['label']}")
print(f" Model: {wf['model']}")
print(f" Input tokens: {total_input:,}/month")
print(f" Output tokens: {total_output:,}/month")
print(f" Input cost: ${input_cost:.2f}")
print(f" Output cost: ${output_cost:.2f}")
print(f" TOTAL: ${workflow_cost:.2f}/month")
print(f"\n{'=' * 60}")
print(f"TOTAL ESTIMATED MONTHLY COST: ${total_monthly:.2f}")
print(f"Budget: $800.00")
print(f"Headroom: ${800 - total_monthly:.2f}")
print(f"Budget utilization: {(total_monthly/800)*100:.1f}%")
Running this produces:
============================================================
ACME ENGINEERING — Monthly Token Cost Estimate
============================================================
CI/CD Code Review
Model: claude-3-5-haiku-20241022
Input tokens: 752,000/month
Output tokens: 104,000/month
Input cost: $0.60
Output cost: $0.42
TOTAL: $1.02/month
Developer Coding Assistant
Model: claude-3-5-haiku-20241022
Input tokens: 1,728,000/month
Output tokens: 1,036,800/month
Input cost: $1.38
Output cost: $4.15
TOTAL: $5.53/month
QA Test Generation
Model: claude-3-5-sonnet-20241022
Input tokens: 32,000/month
Output tokens: 56,000/month
Input cost: $0.10
Output cost: $0.84
TOTAL: $0.94/month
PM Requirements Drafting
Model: claude-3-5-haiku-20241022
Input tokens: 158,400/month
Output tokens: 72,000/month
Input cost: $0.13
Output cost: $0.29
TOTAL: $0.42/month
============================================================
TOTAL ESTIMATED MONTHLY COST: $7.91
Budget: $800.00
Headroom: $792.09
Budget utilization: 1.0%
The estimate is well within budget. This is common for teams just starting agentic workflows — the bottleneck is rarely cost, it's adoption. Budget the $800/month to give room for 10x-100x growth in usage as the team adopts these tools.
Tip: Always run your baseline estimate before your first production deployment. The numbers almost always surprise teams — either costs are much lower than feared (enabling faster adoption) or much higher than expected (triggering immediate optimization). Both outcomes are useful.
Phase 2: Implement the Budget Framework
Now build the actual framework. We'll implement this as a Python package that all four workflows can import.
Project Structure
acme_ai_budget/
├── __init__.py
├── config.py # Budget configuration
├── client.py # Instrumented LLM client
├── budget.py # Budget enforcement
├── router.py # Model selection router
├── monitor.py # Real-time monitoring
├── alerts.py # Alerting system
└── report.py # ROI reporting
config.py — Budget Configuration
from dataclasses import dataclass, field
from typing import Dict
@dataclass
class WorkflowBudget:
"""Budget configuration for a single workflow."""
monthly_usd: float
daily_usd: float
per_task_soft_tokens: int
per_task_hard_tokens: int
per_session_soft_tokens: int
per_session_hard_tokens: int
max_loop_steps: int = 20
alert_email: str = "[email protected]"
TOTAL_MONTHLY_BUDGET_USD = 800.00
BUDGET_RESERVE_PCT = 0.20 # Reserve 20% as buffer
WORKFLOW_BUDGETS: Dict[str, WorkflowBudget] = {
"ci_cd_review": WorkflowBudget(
monthly_usd=200.00, # 25% of budget — high frequency, high value
daily_usd=7.00,
per_task_soft_tokens=6_000,
per_task_hard_tokens=12_000,
per_session_soft_tokens=6_000, # Single-turn, reuse task limits
per_session_hard_tokens=12_000,
max_loop_steps=5, # Code review should be quick
),
"dev_assistant": WorkflowBudget(
monthly_usd=350.00, # 44% of budget — highest engagement
daily_usd=12.00,
per_task_soft_tokens=5_000, # Per turn
per_task_hard_tokens=10_000,
per_session_soft_tokens=60_000, # Session accumulation
per_session_hard_tokens=120_000,
max_loop_steps=30, # Coding sessions can be long
),
"qa_test_gen": WorkflowBudget(
monthly_usd=150.00,
daily_usd=8.00,
per_task_soft_tokens=8_000,
per_task_hard_tokens=16_000,
per_session_soft_tokens=8_000,
per_session_hard_tokens=16_000,
max_loop_steps=10,
),
"pm_assistant": WorkflowBudget(
monthly_usd=100.00,
daily_usd=4.00,
per_task_soft_tokens=4_000,
per_task_hard_tokens=8_000,
per_session_soft_tokens=40_000,
per_session_hard_tokens=80_000,
max_loop_steps=15,
),
}
MODEL_ROUTING = {
"ci_cd_review": {
"default": "claude-3-5-haiku-20241022",
"complex": "claude-3-5-sonnet-20241022", # Large diffs or security issues
"threshold_tokens": 6_000, # Escalate if input > threshold
},
"dev_assistant": {
"default": "claude-3-5-haiku-20241022",
"complex": "claude-3-5-sonnet-20241022",
"complexity_keywords": ["architect", "design", "system", "performance", "security"],
},
"qa_test_gen": {
"default": "claude-3-5-sonnet-20241022", # Always use Sonnet for test quality
"simple": "claude-3-5-haiku-20241022", # Simple CRUD test generation
},
"pm_assistant": {
"default": "claude-3-5-haiku-20241022",
"complex": "claude-3-5-sonnet-20241022",
"complexity_keywords": ["strategy", "architecture", "roadmap", "competitive"],
},
}
budget.py — Core Budget Enforcement
import time
import redis
from typing import Optional
from .config import WORKFLOW_BUDGETS, TOTAL_MONTHLY_BUDGET_USD
COST_TABLE = {
"claude-3-5-sonnet-20241022": {"input": 3.00/1e6, "output": 15.00/1e6},
"claude-3-5-haiku-20241022": {"input": 0.80/1e6, "output": 4.00/1e6},
"claude-3-haiku-20240307": {"input": 0.25/1e6, "output": 1.25/1e6},
}
class BudgetEnforcer:
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.redis = redis.from_url(redis_url)
def calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
costs = COST_TABLE.get(model, {"input": 0.003/1e6, "output": 0.015/1e6})
return input_tokens * costs["input"] + output_tokens * costs["output"]
def _key(self, workflow_id: str, period: str) -> str:
from datetime import datetime
if period == "daily":
date = datetime.utcnow().strftime("%Y-%m-%d")
return f"acme:budget:{workflow_id}:daily:{date}"
elif period == "monthly":
month = datetime.utcnow().strftime("%Y-%m")
return f"acme:budget:{workflow_id}:monthly:{month}"
elif period == "total_monthly":
month = datetime.utcnow().strftime("%Y-%m")
return f"acme:budget:total:monthly:{month}"
def check_and_record(
self,
workflow_id: str,
model: str,
input_tokens: int,
output_tokens: int,
) -> dict:
"""
Check budget, record spend if within limits.
Returns dict with allowed status and current spend info.
"""
cost = self.calculate_cost(model, input_tokens, output_tokens)
budget = WORKFLOW_BUDGETS.get(workflow_id)
if not budget:
return {"allowed": True, "cost": cost, "reason": "no_budget_configured"}
# Effective limits with reserve buffer
effective_monthly = budget.monthly_usd * (1 - 0) # Full limit
effective_daily = budget.daily_usd
daily_key = self._key(workflow_id, "daily")
monthly_key = self._key(workflow_id, "monthly")
total_key = self._key(workflow_id, "total_monthly")
# Read current values
current_daily = float(self.redis.get(daily_key) or 0)
current_monthly = float(self.redis.get(monthly_key) or 0)
total_monthly = float(self.redis.get(total_key) or 0)
# Check total organization budget first
if total_monthly + cost > TOTAL_MONTHLY_BUDGET_USD:
return {
"allowed": False,
"reason": "total_org_budget_exceeded",
"total_monthly": total_monthly,
"org_limit": TOTAL_MONTHLY_BUDGET_USD
}
# Check daily limit
if current_daily + cost > effective_daily:
return {
"allowed": False,
"reason": "daily_limit_exceeded",
"workflow": workflow_id,
"daily_spend": current_daily,
"daily_limit": effective_daily
}
# Check monthly limit
if current_monthly + cost > effective_monthly:
return {
"allowed": False,
"reason": "monthly_limit_exceeded",
"workflow": workflow_id,
"monthly_spend": current_monthly,
"monthly_limit": effective_monthly
}
# Record the spend
pipe = self.redis.pipeline()
pipe.incrbyfloat(daily_key, cost)
pipe.expire(daily_key, 86400 * 2)
pipe.incrbyfloat(monthly_key, cost)
pipe.expire(monthly_key, 86400 * 35)
pipe.incrbyfloat(total_key, cost)
pipe.expire(total_key, 86400 * 35)
pipe.execute()
return {
"allowed": True,
"cost": cost,
"daily_spend_after": current_daily + cost,
"monthly_spend_after": current_monthly + cost,
"daily_pct": ((current_daily + cost) / effective_daily) * 100,
"monthly_pct": ((current_monthly + cost) / effective_monthly) * 100,
}
def get_all_status(self) -> dict:
"""Return current budget status for all workflows."""
from datetime import datetime
month = datetime.utcnow().strftime("%Y-%m")
date = datetime.utcnow().strftime("%Y-%m-%d")
status = {}
for workflow_id, budget in WORKFLOW_BUDGETS.items():
daily_key = self._key(workflow_id, "daily")
monthly_key = self._key(workflow_id, "monthly")
daily_spend = float(self.redis.get(daily_key) or 0)
monthly_spend = float(self.redis.get(monthly_key) or 0)
status[workflow_id] = {
"daily_spend": daily_spend,
"daily_limit": budget.daily_usd,
"daily_pct": (daily_spend / budget.daily_usd) * 100,
"monthly_spend": monthly_spend,
"monthly_limit": budget.monthly_usd,
"monthly_pct": (monthly_spend / budget.monthly_usd) * 100,
"health": (
"CRITICAL" if daily_spend >= budget.daily_usd else
"WARNING" if daily_spend >= budget.daily_usd * 0.8 else
"OK"
)
}
return status
Tip: Use atomic Redis pipeline operations for all budget increment-and-check operations. Non-atomic operations create race conditions where two simultaneous requests can both pass a budget check and both record spend, putting you over limit. The pipeline ensure correctness under concurrent load.
Phase 3: Implement the Model Router
from .config import MODEL_ROUTING
class TaskRouter:
"""Routes tasks to appropriate models based on complexity signals."""
def select_model(self, workflow_id: str, prompt: str, context: dict = None) -> str:
routing = MODEL_ROUTING.get(workflow_id, {})
default_model = routing.get("default", "claude-3-5-haiku-20241022")
complex_model = routing.get("complex", "claude-3-5-sonnet-20241022")
# Check token threshold (large input = complex task)
threshold = routing.get("threshold_tokens")
if threshold:
estimated_tokens = len(prompt.split()) * 1.3
if estimated_tokens > threshold:
return complex_model
# Check complexity keywords
keywords = routing.get("complexity_keywords", [])
prompt_lower = prompt.lower()
if any(kw in prompt_lower for kw in keywords):
return complex_model
# Check explicit context signals
if context:
if context.get("is_complex"):
return complex_model
if context.get("is_simple"):
return routing.get("simple", default_model)
return default_model
def get_max_tokens(self, workflow_id: str, model: str) -> int:
"""Return appropriate max_tokens for workflow + model combination."""
MAX_TOKENS = {
"ci_cd_review": {"haiku": 1024, "sonnet": 2048},
"dev_assistant": {"haiku": 2048, "sonnet": 4096},
"qa_test_gen": {"haiku": 3000, "sonnet": 6000},
"pm_assistant": {"haiku": 1500, "sonnet": 3000},
}
model_tier = "sonnet" if "sonnet" in model else "haiku"
return MAX_TOKENS.get(workflow_id, {}).get(model_tier, 2048)
Phase 4: Wire Up the Main Client
import anthropic
import time
import uuid
from datetime import datetime, timezone
from typing import Optional, List, Dict
from .budget import BudgetEnforcer
from .router import TaskRouter
from .monitor import UsageLogger
from .alerts import AlertManager
class AcmeBudgetedClient:
"""
Main entry point for all Acme AI interactions.
Handles model routing, budget enforcement, and usage logging.
"""
def __init__(
self,
redis_url: str = "redis://localhost:6379",
slack_webhook: Optional[str] = None,
):
self.anthropic = anthropic.Anthropic()
self.enforcer = BudgetEnforcer(redis_url)
self.router = TaskRouter()
self.logger = UsageLogger(redis_url)
self.alerts = AlertManager(slack_webhook) if slack_webhook else None
def complete(
self,
workflow_id: str,
messages: List[Dict],
system: str = "",
context: dict = None,
session_id: Optional[str] = None,
user_id: Optional[str] = None,
loop_step: Optional[int] = None,
) -> dict:
"""
Main completion method with full budget management.
Returns dict with content, model, usage, and budget status.
"""
# Select model
last_user_msg = next(
(m["content"] for m in reversed(messages) if m["role"] == "user"),
""
)
model = self.router.select_model(workflow_id, last_user_msg, context)
max_tokens = self.router.get_max_tokens(workflow_id, model)
# Pre-flight: estimate cost and check budget
# Use rough estimate: input = sum of message lengths × 1.3
est_input = sum(len(str(m.get("content", ""))) for m in messages) // 4
est_input += len(system) // 4
pre_check = self.enforcer.check_and_record(
workflow_id, model, est_input, max_tokens // 4
)
if not pre_check["allowed"]:
return {
"content": (
f"I'm unable to process this request — the {workflow_id} workflow "
f"has reached its budget limit ({pre_check['reason']}). "
"Please contact your team lead or try again tomorrow."
),
"model": model,
"budget_blocked": True,
"budget_reason": pre_check["reason"]
}
# Make the API call
start_time = time.time()
kwargs = {
"model": model,
"max_tokens": max_tokens,
"messages": messages,
}
if system:
kwargs["system"] = system
response = self.anthropic.messages.create(**kwargs)
latency_ms = int((time.time() - start_time) * 1000)
# Reconcile actual cost vs. estimate
actual_input = response.usage.input_tokens
actual_output = response.usage.output_tokens
# Correct the ledger: add difference between actual and estimate
estimate_input = est_input
estimate_output = max_tokens // 4
correction_input = actual_input - estimate_input
correction_output = actual_output - estimate_output
if correction_input != 0 or correction_output != 0:
self.enforcer.check_and_record(
workflow_id, model,
max(0, correction_input),
max(0, correction_output)
)
# Log the event
self.logger.log(
workflow_id=workflow_id,
session_id=session_id or str(uuid.uuid4()),
user_id=user_id or "unknown",
model=model,
input_tokens=actual_input,
output_tokens=actual_output,
latency_ms=latency_ms,
loop_step=loop_step,
)
# Check alert thresholds
if self.alerts:
status = self.enforcer.get_all_status()
wf_status = status.get(workflow_id, {})
if wf_status.get("daily_pct", 0) >= 80:
self.alerts.warn(
title=f"Budget Warning: {workflow_id}",
message=f"Daily budget {wf_status['daily_pct']:.1f}% consumed",
context=wf_status
)
return {
"content": response.content[0].text,
"model": model,
"input_tokens": actual_input,
"output_tokens": actual_output,
"latency_ms": latency_ms,
"budget_blocked": False,
}
Tip: Return budget status information in every API response, even when the request succeeds. Downstream code can use this to display budget indicators in developer tools, log budget health metrics, or trigger graceful context compression before hitting hard limits.
Phase 5: Build the Monthly Budget Report
from .budget import BudgetEnforcer
from .config import WORKFLOW_BUDGETS, TOTAL_MONTHLY_BUDGET_USD
from datetime import datetime
def generate_monthly_report(enforcer: BudgetEnforcer) -> str:
status = enforcer.get_all_status()
month = datetime.utcnow().strftime("%B %Y")
total_spend = sum(s["monthly_spend"] for s in status.values())
total_budget = TOTAL_MONTHLY_BUDGET_USD
report_lines = [
f"ACME ENGINEERING — AI Budget Report ({month})",
"=" * 60,
f"",
f"Organization Summary:",
f" Total spend: ${total_spend:.2f} of ${total_budget:.2f}",
f" Budget used: {(total_spend/total_budget)*100:.1f}%",
f" Remaining: ${total_budget - total_spend:.2f}",
f"",
f"Workflow Breakdown:",
]
for workflow_id, wf_status in status.items():
health_icon = {"OK": "[OK]", "WARNING": "[WARN]", "CRITICAL": "[CRIT]"}[
wf_status["health"]
]
report_lines.extend([
f"",
f" {health_icon} {workflow_id}",
f" Monthly: ${wf_status['monthly_spend']:.4f} / ${wf_status['monthly_limit']:.2f}"
f" ({wf_status['monthly_pct']:.1f}%)",
f" Daily: ${wf_status['daily_spend']:.4f} / ${wf_status['daily_limit']:.2f}"
f" ({wf_status['daily_pct']:.1f}%)",
])
# Estimated ROI summary
report_lines.extend([
f"",
"=" * 60,
f"ROI Summary (estimated):",
f" Engineer time saved: ~180 hrs/month × $90/hr = $16,200",
f" QA time saved: ~90 hrs/month × $70/hr = $6,300",
f" PM time saved: ~40 hrs/month × $100/hr = $4,000",
f" Total value created: $26,500/month",
f" Total AI investment: ${total_spend:.2f}/month",
f" Net ROI: {((26500 - total_spend) / max(total_spend, 0.01)) * 100:.0f}%",
])
return "\n".join(report_lines)
if __name__ == "__main__":
enforcer = BudgetEnforcer()
print(generate_monthly_report(enforcer))
Tip: Schedule this report to run on the 1st of each month (via cron or GitHub Actions) and post it to your team Slack channel automatically. The act of publishing the report publicly — even just to your team — creates accountability and naturally drives optimization behavior without any top-down mandate.
Phase 6: Integration Test Your Framework
Before deploying to production, run integration tests against all four workflows:
import pytest
from unittest.mock import MagicMock, patch
from acme_ai_budget.budget import BudgetEnforcer
from acme_ai_budget.router import TaskRouter
from acme_ai_budget.config import WORKFLOW_BUDGETS
def test_router_selects_haiku_for_simple_task():
router = TaskRouter()
model = router.select_model("dev_assistant", "extract the function name from this code")
assert "haiku" in model
def test_router_escalates_to_sonnet_for_architecture_task():
router = TaskRouter()
model = router.select_model("dev_assistant", "design the architecture for our microservices")
assert "sonnet" in model
def test_budget_allows_request_within_limits(mock_redis):
enforcer = BudgetEnforcer()
result = enforcer.check_and_record(
"ci_cd_review",
"claude-3-5-haiku-20241022",
input_tokens=1000,
output_tokens=500
)
assert result["allowed"] is True
def test_budget_blocks_request_when_daily_limit_exceeded(mock_redis):
enforcer = BudgetEnforcer()
budget = WORKFLOW_BUDGETS["ci_cd_review"]
# Simulate that daily spend is already at limit
mock_redis.get.return_value = str(budget.daily_usd).encode()
result = enforcer.check_and_record(
"ci_cd_review",
"claude-3-5-haiku-20241022",
input_tokens=1000,
output_tokens=500
)
assert result["allowed"] is False
assert result["reason"] == "daily_limit_exceeded"
def test_full_workflow_integration():
"""End-to-end test with mocked Anthropic API."""
with patch("anthropic.Anthropic") as mock_anthropic:
mock_response = MagicMock()
mock_response.content[0].text = "Test output"
mock_response.usage.input_tokens = 500
mock_response.usage.output_tokens = 200
mock_response.stop_reason = "end_turn"
mock_anthropic.return_value.messages.create.return_value = mock_response
from acme_ai_budget.client import AcmeBudgetedClient
client = AcmeBudgetedClient(redis_url="redis://localhost:6379")
result = client.complete(
workflow_id="ci_cd_review",
messages=[{"role": "user", "content": "Review this code: def foo(): pass"}],
system="You are a code reviewer.",
user_id="engineer_001"
)
assert result["budget_blocked"] is False
assert result["content"] == "Test output"
assert result["input_tokens"] == 500
Tip: Add these integration tests to your CI pipeline and run them on every code change to the budget framework. A broken budget guardrail that silently allows unlimited spending is worse than no guardrail — the broken state gives false confidence. CI tests prevent regressions in your financial controls.
Phase 7: Deploy and Validate
Once the framework is implemented and tested:
Week 1: Deploy to one workflow (start with CI/CD review — automated, easy to measure)
- Enable all logging
- Set alerts to INFO level (no paging yet)
- Compare actual costs to estimate
Week 2: Deploy to all workflows
- Enable WARNING alerts to Slack
- Run baseline ROI measurement
Week 3: Enable CRITICAL alerts with PagerDuty
- Adjust budget limits based on real data
- Tune model routing thresholds
Week 4: Generate first monthly report
- Share ROI analysis with team lead
- Identify top optimization opportunities
Month 2+: Run the monthly ROI review
- Adjust budgets based on actual consumption
- Expand to new workflows as team adopts AI tooling
Tip: When you deploy the budget framework, announce it to your team with context: "This framework gives us visibility into how we're using our AI budget and makes sure no one accidentally runs up a big bill. It's not here to limit your use of AI — it's here to help us use it more." The framing matters enormously for adoption.
Summary
This hands-on lab walked through building a production-grade token budget framework from inventory through implementation to deployment. The complete framework includes baseline cost estimation, Redis-backed budget enforcement at daily and monthly granularities, model routing that balances quality and cost across all four workflow types, integration tests that protect your financial controls from regressions, and automated monthly reporting that keeps ROI visible to the whole team. The framework is intentionally modular — you can adopt any one component independently and add the others as your team matures. The goal is not a perfect system on day one, but a system that gets measurably better each month.