This capstone brings together every technique from the course into a single, end-to-end token optimization audit. You will work through a realistic scenario: a mid-sized software engineering team running an AI-assisted PR review pipeline that has grown organically over eight months and now costs significantly more than it should. You will audit it, diagnose the problems, implement optimizations, validate results, and establish ongoing monitoring — exactly as you would in production.
This walkthrough is designed to be followed hands-on. Every section includes executable code, concrete decisions, and realistic trade-offs.
The Scenario: The PR Review Pipeline
Context: You are the platform engineering lead at a 60-person software company. Eight months ago, your team shipped an AI-assisted PR review system. Engineers submit PRs and the system automatically generates a code review covering security, performance, logic, and style.
Current state of the pipeline:
- 400 PRs reviewed per day across 4 engineering squads
- Average cost: $0.38 per PR review
- Monthly cost: approximately $4,560
- Initial target: $0.12 per PR review (established when the system was first specced)
- Quality score: 4.1/5.0 (measured by quarterly engineer surveys)
Your mandate: Reduce cost to ≤ $0.15 per PR review without dropping quality below 3.8/5.0.
The pipeline architecture:
PR Submitted → Context Builder → System Prompt Builder → LLM (GPT-4o) → Review Formatter → Engineer
↑ ↑
[RAG: codebase [Static guidelines + [Multi-turn for
conventions] team-specific rules] clarification]
Phase 1: Establishing the Baseline
Before touching anything, you need a complete picture of current token consumption. This is non-negotiable — optimization without a baseline is navigation without a map.
Step 1.1: Deploy Instrumentation
The pipeline was built eight months ago and has minimal observability. Your first task is to add structured telemetry without changing any prompts or logic.
import time
import uuid
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
import json
@dataclass
class PRReviewTrace:
trace_id: str
timestamp_utc: str
pr_id: str
squad: str
pr_size_category: str # small (<50 lines), medium (50-300), large (>300)
# Context breakdown
system_prompt_tokens: int
rag_context_tokens: int
pr_diff_tokens: int
conversation_history_tokens: int
total_input_tokens: int
# Output
output_tokens: int
total_tokens: int
# Turn tracking
turn_count: int
# Cost
cost_usd: float
model: str
# Performance
latency_ms: int
# Downstream
quality_score: float = None # Populated later from survey data
def wrap_pr_review_with_tracing(pr_id: str, squad: str, pr_diff: str):
trace_id = str(uuid.uuid4())
start_time = time.time()
# Build context (existing code, just instrument it)
system_prompt = build_system_prompt()
rag_context = retrieve_codebase_context(pr_diff)
conversation_history = get_conversation_history(pr_id)
# Count tokens before LLM call
sp_tokens = count_tokens(system_prompt)
rag_tokens = count_tokens(rag_context)
diff_tokens = count_tokens(pr_diff)
history_tokens = count_tokens(conversation_history)
# Make LLM call (existing code)
response = llm_client.chat.completions.create(
model="gpt-4o",
messages=build_messages(system_prompt, rag_context, pr_diff, conversation_history)
)
latency_ms = int((time.time() - start_time) * 1000)
cost = calculate_cost(response.usage, "gpt-4o")
# Write trace
trace = PRReviewTrace(
trace_id=trace_id,
timestamp_utc=datetime.now(timezone.utc).isoformat(),
pr_id=pr_id,
squad=squad,
pr_size_category=categorize_pr_size(pr_diff),
system_prompt_tokens=sp_tokens,
rag_context_tokens=rag_tokens,
pr_diff_tokens=diff_tokens,
conversation_history_tokens=history_tokens,
total_input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
total_tokens=response.usage.total_tokens,
turn_count=count_turns(conversation_history),
cost_usd=cost,
model="gpt-4o",
latency_ms=latency_ms
)
write_to_analytics_store(asdict(trace))
return response.choices[0].message.content
Run this instrumentation for two weeks without any changes. You need a clean baseline.
Step 1.2: Analyze the Baseline Data
After two weeks, run the baseline analysis:
import pandas as pd
import numpy as np
def generate_baseline_report(df: pd.DataFrame) -> dict:
"""
Comprehensive baseline report for the PR review pipeline.
"""
# Overall statistics
overall = {
"total_reviews": len(df),
"daily_average": len(df) / 14,
"avg_total_tokens": df["total_tokens"].mean(),
"median_total_tokens": df["total_tokens"].median(),
"p95_total_tokens": df["total_tokens"].quantile(0.95),
"avg_cost_usd": df["cost_usd"].mean(),
"total_cost_two_weeks": df["cost_usd"].sum(),
"monthly_projected_cost": df["cost_usd"].sum() * 2.17
}
# Token breakdown
token_breakdown = {
"system_prompt": {
"avg": df["system_prompt_tokens"].mean(),
"pct_of_input": df["system_prompt_tokens"].mean() / df["total_input_tokens"].mean()
},
"rag_context": {
"avg": df["rag_context_tokens"].mean(),
"pct_of_input": df["rag_context_tokens"].mean() / df["total_input_tokens"].mean()
},
"pr_diff": {
"avg": df["pr_diff_tokens"].mean(),
"pct_of_input": df["pr_diff_tokens"].mean() / df["total_input_tokens"].mean()
},
"conversation_history": {
"avg": df["conversation_history_tokens"].mean(),
"pct_of_input": df["conversation_history_tokens"].mean() / df["total_input_tokens"].mean()
}
}
# By PR size
by_size = df.groupby("pr_size_category").agg({
"total_tokens": ["mean", "count"],
"cost_usd": "mean",
"turn_count": "mean"
}).round(2)
# By squad
by_squad = df.groupby("squad").agg({
"total_tokens": "mean",
"cost_usd": ["mean", "sum"],
"turn_count": "mean"
}).round(2)
return {
"overall": overall,
"token_breakdown": token_breakdown,
"by_size": by_size.to_dict(),
"by_squad": by_squad.to_dict()
}
Baseline findings (realistic for this scenario):
| Component | Average Tokens | % of Total Input |
|---|---|---|
| System prompt | 1,847 | 19.3% |
| RAG context | 3,420 | 35.7% |
| PR diff | 3,890 | 40.6% |
| Conversation history | 418 | 4.4% |
| Total input | 9,575 | 100% |
| Output | 1,840 | — |
| Grand total | 11,415 | — |
Key observations from the baseline:
- The system prompt is 1,847 tokens — nearly 3x the expected size (the original design called for ~650 tokens)
- RAG context is 3,420 tokens, pulling from 8 chunks of 500 tokens each — the pipeline never had a retrieval quality filter
- Conversation history is present even on first-turn reviews (a bug: stale sessions are not being cleared)
- The output is averaging 1,840 tokens — the review format is not constraining response length
- Multi-turn usage: 23% of reviews involve 2+ turns; these average 18,500 total tokens vs. 10,100 for single-turn
Tip: Always break down token consumption by component before hypothesizing fixes. In this scenario, a developer who "just looks at the system prompt" would miss the RAG inefficiency and the conversation history bug — together worth more savings than the system prompt compression.
Phase 2: Identifying and Prioritizing Optimization Opportunities
Step 2.1: Build the Opportunity Map
From the baseline analysis, we can quantify the opportunity for each component:
OPTIMIZATION_OPPORTUNITIES = [
{
"id": "OPT-001",
"name": "Fix conversation history bug",
"component": "conversation_history",
"type": "bug_fix",
"description": "Stale session history is being injected even on first-turn reviews",
"baseline_avg_tokens": 418,
"expected_post_fix_tokens": 0, # First-turn reviews should have zero history
"affected_pct": 0.77, # 77% of reviews are single-turn
"implementation_effort": "low",
"quality_risk": "none",
"weekly_token_savings": 418 * 0.77 * 400 * 7, # ~900K tokens/week
"weekly_cost_savings": None # Calculate below
},
{
"id": "OPT-002",
"name": "Compress system prompt",
"component": "system_prompt",
"type": "optimization",
"description": "System prompt grew from 650 to 1,847 tokens over 8 months via feature additions. "
"Restructuring as compact numbered lists should reach ~700 tokens.",
"baseline_avg_tokens": 1847,
"expected_post_opt_tokens": 700,
"implementation_effort": "medium",
"quality_risk": "low",
"ab_test_required": True
},
{
"id": "OPT-003",
"name": "Enable prompt caching for system prompt",
"component": "system_prompt",
"type": "infrastructure",
"description": "System prompt is static. Enabling Anthropic/OpenAI prompt caching "
"would reduce effective cost of system prompt tokens by 90%.",
"implementation_effort": "low",
"quality_risk": "none",
"depends_on": "OPT-002" # More valuable after compression
},
{
"id": "OPT-004",
"name": "Add RAG relevance filtering",
"component": "rag_context",
"type": "optimization",
"description": "Currently retrieves top-8 chunks with no relevance threshold. "
"Filtering to cosine similarity > 0.65 and top-5 would cut RAG tokens by ~40%.",
"baseline_avg_tokens": 3420,
"expected_post_opt_tokens": 2050,
"implementation_effort": "medium",
"quality_risk": "medium",
"ab_test_required": True
},
{
"id": "OPT-005",
"name": "Add output format constraints",
"component": "output",
"type": "optimization",
"description": "Reviews are currently unstructured prose. Adding explicit format "
"(JSON schema or compact template) and word limits would reduce output tokens.",
"baseline_avg_tokens": 1840,
"expected_post_opt_tokens": 950,
"implementation_effort": "low",
"quality_risk": "medium",
"ab_test_required": True
},
{
"id": "OPT-006",
"name": "Route small PRs to GPT-4o-mini",
"component": "model",
"type": "model_routing",
"description": "Small PRs (<50 lines, ~30% of volume) may be adequately reviewed "
"by GPT-4o-mini at 10x lower cost.",
"affected_pct": 0.30,
"implementation_effort": "medium",
"quality_risk": "medium",
"ab_test_required": True
}
]
Step 2.2: Prioritize by Impact × Risk
| ID | Opportunity | Weekly Savings (est.) | Risk | Priority |
|---|---|---|---|---|
| OPT-001 | Fix history bug | $280 | None | P0 — Ship immediately |
| OPT-003 | Enable prompt caching | $180 | None | P0 — Ship immediately |
| OPT-002 | Compress system prompt | $340 | Low | P1 — A/B test first |
| OPT-005 | Output format constraints | $360 | Medium | P1 — A/B test first |
| OPT-004 | RAG relevance filtering | $220 | Medium | P2 — A/B test first |
| OPT-006 | Route small PRs to mini model | $210 | Medium | P2 — A/B test first |
Total estimated weekly savings if all shipped: $1,590/week ($6,900/month)
This would bring cost per review from $0.38 to approximately $0.095 — well under the $0.15 target.
Tip: Always ship no-risk optimizations (bug fixes, prompt caching) first, before running A/B tests. They establish a new, lower baseline, which means your A/B tests start from an already-improved state. The percentage improvements from subsequent optimizations are calculated on a smaller number, which makes them more statistically detectable.
Phase 3: Implementing the Optimizations
Step 3.1: Ship OPT-001 — Fix the History Bug
Investigation reveals that the session management code was reusing session objects across PR submissions:
class PRReviewSession:
_instance = None
def __init__(self):
self.history = []
@classmethod
def get_session(cls, pr_id: str):
if cls._instance is None:
cls._instance = PRReviewSession()
return cls._instance # BUG: Same instance for all PRs!
class PRReviewSession:
_sessions = {}
@classmethod
def get_session(cls, pr_id: str):
if pr_id not in cls._sessions:
cls._sessions[pr_id] = PRReviewSession()
return cls._sessions[pr_id]
@classmethod
def clear_session(cls, pr_id: str):
cls._sessions.pop(pr_id, None)
Deploy, monitor for 48 hours, confirm conversation history tokens drop to ~0 for first-turn reviews.
Measured result after 48 hours: Average conversation history tokens: 22 (near-zero, down from 418). Cost per review: $0.32 (down from $0.38).
Step 3.2: Enable Prompt Caching (OPT-003)
def build_messages_with_caching(system_prompt: str, user_content: str) -> list:
return [
{
"role": "system",
"content": system_prompt # OpenAI automatically caches if identical
},
{
"role": "user",
"content": user_content
}
]
def build_anthropic_messages_with_caching(system_prompt: str, user_content: str) -> dict:
return {
"system": [
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} # Enable prompt caching
}
],
"messages": [
{"role": "user", "content": user_content}
]
}
Measured result after 72 hours: System prompt cache hit rate: 94%. Effective cost of system prompt tokens reduced by ~85%. Cost per review: $0.27 (down from $0.32).
Step 3.3: Run A/B Test for System Prompt Compression (OPT-002)
Before (1,847 tokens): The current system prompt includes role preamble, extensive context-setting paragraphs, detailed review guidelines in prose form, a security checklist in paragraph format, output format description, and closing caveats.
After (714 tokens): Compressed version removes preamble, converts prose to numbered lists, uses compact security checklist.
You are an expert senior software engineer with deep experience in code review.
Your role is to provide comprehensive, actionable feedback on code changes submitted
as pull requests. You have extensive knowledge of software security, performance
optimization, and code quality best practices. When reviewing code, you approach
it as a mentor who wants to help the team improve their craft while ensuring the
codebase remains maintainable and secure.
When conducting your review, please carefully consider the following aspects:
**Correctness and Logic**: Examine the code for logical errors, off-by-one errors,
null pointer dereferences, race conditions, and other correctness issues. Consider
edge cases that the developer may have missed...
[continues for 1,847 tokens total]
---
Senior software engineer. Review PR diffs for:
1. **Correctness**: Logic errors, edge cases, null handling, race conditions
2. **Security**: Injection (SQL/XSS/cmd), auth/authz flaws, secrets exposure,
insecure deps, input validation
3. **Performance**: N+1 queries, missing indexes, inefficient algorithms,
memory leaks, blocking I/O
4. **Maintainability**: Naming clarity, function size, duplication, test coverage
5. **Style**: Consistency with existing patterns
**Output format** (strict):
- **Summary** (2-3 sentences): Overall assessment
- **Critical** (block merge): Numbered list. Each item: file:line — issue — fix
- **Suggestions** (non-blocking): Numbered list. Same format
- **Positives** (1-3 items): What was done well
Max 400 words total. Be specific. Reference exact line numbers.
A/B Test Setup:
- Traffic split: 50/50
- Run duration: 10 days
- Required samples: 500 per variant (calculated via power analysis)
- Primary metric: total tokens per review
- Guard metric: quality score from LLM-as-judge (using GPT-4o-mini as evaluator)
def run_system_prompt_ab_test(pr_id: str, pr_diff: str, squad: str) -> dict:
variant = assign_variant(pr_id, "system_prompt_compression_v1")
if variant == "control":
system_prompt = load_prompt("control_system_prompt_v1847")
else:
system_prompt = load_prompt("treatment_system_prompt_v714")
response = call_gpt4o(system_prompt, pr_diff)
# LLM-as-judge quality evaluation (async, does not block response)
quality_score = evaluate_review_quality_async(
review=response.content,
pr_diff=pr_diff,
criteria=["specificity", "actionability", "coverage", "correctness"]
)
log_experiment_result(
experiment="system_prompt_compression_v1",
variant=variant,
pr_id=pr_id,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
quality_score=quality_score
)
return {"review": response.content, "variant": variant}
A/B Test Results (after 10 days, 1,247 reviews per variant):
| Metric | Control | Treatment | Change |
|---|---|---|---|
| Avg system prompt tokens | 1,847 | 714 | -61.3% |
| Avg total input tokens | 8,982 | 7,849 | -12.6% |
| Avg output tokens | 1,840 | 1,120 | -39.1% |
| Avg total tokens | 10,822 | 8,969 | -17.1% |
| Avg quality score | 4.10 | 4.05 | -0.05 |
| P-value (Mann-Whitney U) | — | — | 0.0003 |
Decision: SHIP treatment. Token reduction is statistically significant and meaningful (-17.1%). Quality delta is -0.05, well within the acceptable range (floor is 3.8; treatment is 4.05).
Step 3.4: A/B Test Output Format Constraints (OPT-005)
The treatment system prompt already includes "Max 400 words total" but this alone may not be sufficient. Test an explicit output schema:
OUTPUT_FORMAT_TREATMENT = """
Respond ONLY with valid JSON in this exact structure. No prose outside the JSON.
{
"summary": "string (max 50 words)",
"critical_issues": [
{"file": "string", "line": "integer or range", "issue": "string", "fix": "string"}
],
"suggestions": [
{"file": "string", "line": "integer or range", "issue": "string", "fix": "string"}
],
"positives": ["string", "string"]
}
"""
A/B Test Results for output format:
| Metric | Prose format | JSON format | Change |
|---|---|---|---|
| Avg output tokens | 1,120 | 680 | -39.3% |
| Quality (LLM judge) | 4.05 | 4.08 | +0.03 |
| Parsing failure rate | 0% | 1.2% | +1.2% |
Decision: SHIP JSON format. Output token reduction is substantial. Small quality improvement. Parsing failures (1.2%) are acceptable and can be handled with a fallback parser.
Step 3.5: A/B Test RAG Relevance Filtering (OPT-004)
def retrieve_with_relevance_filter(
query: str,
variant: str
) -> tuple[list[str], int]:
if variant == "control":
# Current: top-8, no relevance filter
chunks = vector_store.similarity_search(query, k=8)
else:
# Treatment: top-5, relevance threshold 0.65
results = vector_store.similarity_search_with_score(query, k=10)
chunks = [
doc for doc, score in results
if score >= 0.65
][:5] # Max 5 chunks after filtering
return chunks, count_tokens(" ".join([c.page_content for c in chunks]))
A/B Test Results for RAG filtering:
| Metric | Control (k=8, no filter) | Treatment (k=5, threshold=0.65) | Change |
|---|---|---|---|
| Avg RAG context tokens | 2,890 | 1,720 | -40.5% |
| Avg total tokens | 8,969 | 7,799 | -13.1% |
| Quality (LLM judge) | 4.05 | 4.01 | -0.04 |
| Retrieval recall (key issues found) | 91% | 88% | -3% |
Decision: SHIP with monitoring. The 3% reduction in retrieval recall is a concern. Monitor engineer satisfaction scores for 30 days post-ship. If satisfaction drops, investigate and potentially tune the threshold to 0.60.
Step 3.6: A/B Test Model Routing for Small PRs (OPT-006)
def route_model(pr_diff: str, pr_size_category: str, variant: str) -> str:
if variant == "treatment" and pr_size_category == "small":
return "gpt-4o-mini" # 10x cheaper for small PRs
return "gpt-4o" # Default for all sizes in control, and large/medium in treatment
eval_results = {
"gpt-4o": {"avg_quality": 4.08, "p5_quality": 3.60, "avg_tokens": 7,800, "avg_cost": 0.0195},
"gpt-4o-mini": {"avg_quality": 3.81, "p5_quality": 3.10, "avg_tokens": 6,200, "avg_cost": 0.0022}
}
A/B Test Results for model routing:
| Metric | Control (GPT-4o) | Treatment (mini for small PRs) | Change |
|---|---|---|---|
| Avg quality (small PRs) | 4.08 | 3.81 | -0.27 |
| P5 quality (small PRs) | 3.60 | 3.10 | -0.50 |
| Avg cost (small PRs) | $0.0195 | $0.0022 | -88.7% |
Decision: REJECT for general routing. P5 quality of 3.10 is below the 3.80 floor. However, test a tiered approach: route only "trivial" small PRs (documentation changes, config updates, single-line fixes) to GPT-4o-mini, with a PR classification step that costs ~$0.001 to route.
def classify_pr_complexity(pr_diff: str) -> str:
"""Quick classification using GPT-4o-mini to route PR to appropriate model."""
response = gpt4o_mini_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Classify this PR diff as 'trivial' (docs, config, single-line) "
f"or 'substantive' (logic changes, new features, bug fixes). "
f"Respond with only the word.\n\n{pr_diff[:1000]}"
}],
max_tokens=5
)
return response.choices[0].message.content.strip().lower()
def smart_model_router(pr_diff: str, pr_size_category: str) -> str:
if pr_size_category == "small":
complexity = classify_pr_complexity(pr_diff) # Costs ~$0.001
if complexity == "trivial":
return "gpt-4o-mini" # Save $0.018 per trivial small PR
return "gpt-4o"
Re-test result: Trivial PRs (12% of total volume) reviewed by GPT-4o-mini achieve 4.01 average quality — above the 3.80 floor. Ship.
Phase 4: Measuring the Cumulative Impact
After all optimizations are shipped (staggered over 6 weeks), run the final comparison:
Cumulative Results Dashboard
def generate_optimization_results_report(
baseline_df: pd.DataFrame,
post_optimization_df: pd.DataFrame
) -> dict:
baseline = {
"avg_total_tokens": baseline_df["total_tokens"].mean(),
"avg_cost_usd": baseline_df["cost_usd"].mean(),
"daily_volume": len(baseline_df) / 14,
"monthly_projected_cost": baseline_df["cost_usd"].mean() * 400 * 30
}
post_opt = {
"avg_total_tokens": post_optimization_df["total_tokens"].mean(),
"avg_cost_usd": post_optimization_df["cost_usd"].mean(),
"daily_volume": len(post_optimization_df) / 14,
"monthly_projected_cost": post_optimization_df["cost_usd"].mean() * 400 * 30
}
return {
"baseline": baseline,
"post_optimization": post_opt,
"deltas": {
"token_reduction_pct": (post_opt["avg_total_tokens"] - baseline["avg_total_tokens"]) / baseline["avg_total_tokens"],
"cost_reduction_pct": (post_opt["avg_cost_usd"] - baseline["avg_cost_usd"]) / baseline["avg_cost_usd"],
"monthly_savings": baseline["monthly_projected_cost"] - post_opt["monthly_projected_cost"]
}
}
Final Results:
| Metric | Baseline | Post-Optimization | Change |
|---|---|---|---|
| Avg system prompt tokens | 1,847 | 714 | -61.3% |
| Avg RAG context tokens | 3,420 | 1,720 | -49.7% |
| Avg history tokens | 418 | 22 | -94.7% |
| Avg output tokens | 1,840 | 680 | -63.0% |
| Avg total tokens | 11,415 | 4,812 | -57.9% |
| Avg cost per review | $0.38 | $0.092 | -75.8% |
| Monthly cost | $4,560 | $1,104 | -$3,456 |
| Quality score | 4.10 | 3.98 | -0.12 |
Target achieved: Cost of $0.092 is well below the $0.15 target. Quality score of 3.98 is above the 3.80 floor.
Token breakdown of optimizations:
| Optimization | Token Savings | % of Total Savings |
|---|---|---|
| OPT-001: History bug fix | 396/review | 5.9% |
| OPT-003: Prompt caching | ~1,570 effective cost reduction | — |
| OPT-002: System prompt compression | 1,133/review | 16.9% |
| OPT-005: Output format constraints | 1,160/review | 17.3% |
| OPT-004: RAG relevance filtering | 1,170/review | 17.5% |
| OPT-006: Trivial PR routing | ~$0.018/trivial PR | — |
| Total direct token reduction | 6,603/review | 57.9% |
Tip: When presenting optimization results to stakeholders, always show both the token reduction and the dollar impact. Engineering leaders respond to token percentages; business stakeholders respond to dollars. "$3,456 per month saved from a 6-week optimization effort" is more compelling than "57.9% token reduction" for executive reporting — even though they represent the same achievement.
Phase 5: Establishing Ongoing Monitoring
The audit is complete but the work is not. A token optimization without ongoing monitoring will regress. Set up the monitoring infrastructure that ensures these gains are preserved.
Monitoring Configuration
ALERT_RULES = [
{
"name": "cost_per_review_regression",
"condition": "avg_cost_usd_24h > 0.15",
"severity": "critical",
"message": "PR review cost exceeded $0.15 threshold (target: $0.092)",
"action": "page_oncall"
},
{
"name": "system_prompt_token_spike",
"condition": "avg_system_prompt_tokens_24h > 900",
"severity": "warning",
"message": "System prompt tokens exceeded 900 (baseline: 714). Check for prompt changes.",
"action": "slack_alert"
},
{
"name": "rag_context_regression",
"condition": "avg_rag_context_tokens_24h > 2500",
"severity": "warning",
"message": "RAG context tokens up significantly. Check relevance filtering.",
"action": "slack_alert"
},
{
"name": "quality_floor_breach",
"condition": "avg_quality_score_24h < 3.80",
"severity": "critical",
"message": "Review quality dropped below 3.80 floor. Investigate immediately.",
"action": "page_oncall"
},
{
"name": "conversation_history_leak",
"condition": "avg_conversation_history_tokens_24h > 100",
"severity": "warning",
"message": "Conversation history tokens rising. Session bug may have recurred.",
"action": "slack_alert"
}
]
def check_all_alerts():
stats = get_24h_stats("pr_review_pipeline")
triggered = []
for rule in ALERT_RULES:
if evaluate_condition(rule["condition"], stats):
triggered.append(rule)
fire_alert(rule)
return triggered
Monthly Benchmark Review
## Monthly Token Optimization Review — PR Review Pipeline
### Benchmark Targets (set post-optimization)
| Metric | Green | Yellow | Red |
|--------|-------|--------|-----|
| Avg cost per review | ≤ $0.12 | $0.12–$0.15 | > $0.15 |
| Avg total tokens | ≤ 5,500 | 5,500–7,000 | > 7,000 |
| System prompt tokens | ≤ 800 | 800–1,000 | > 1,000 |
| RAG context tokens | ≤ 2,000 | 2,000–2,500 | > 2,500 |
| Quality score | ≥ 4.00 | 3.80–4.00 | < 3.80 |
### Review Process (Monthly)
1. Pull 30-day stats from analytics store
2. Compare against benchmark targets
3. Identify any metrics in Yellow or Red
4. For each Yellow/Red metric: create hypothesis ticket
5. Review experiment backlog: what is ready to ship?
6. Update the "What Works" registry with any new learnings
Phase 6: Capstone Retrospective and Reusable Artifacts
Lessons Learned
Finding 1: Bugs beat optimizations. The conversation history bug (OPT-001) was the fastest fix and among the most impactful. Always run a "correctness audit" before starting optimization work. Look for tokens being consumed by behavior that is simply wrong.
Finding 2: Organic prompt growth is the most common problem. The system prompt grew from 650 to 1,847 tokens over 8 months with no single large change — just accumulation. This is the most common cause of token cost inflation in production systems. A periodic "prompt hygiene review" would have caught this at 900 tokens rather than 1,847.
Finding 3: Output constraints outperformed input constraints. Reducing output tokens from 1,840 to 680 (a 63% reduction) had a larger impact than compressing the system prompt (61% reduction) despite the system prompt having more absolute tokens, because output tokens are priced 4–5x higher than input tokens.
Finding 4: Model routing needs a pre-classifier. Naively routing small PRs to a cheaper model failed the quality floor. A lightweight pre-classification step ($0.001) unlocked the routing saving by distinguishing trivial from substantive small PRs.
Finding 5: A/B testing duration matters. The system prompt A/B test would have been inconclusive after 3 days. It took 7 days to reach statistical significance. Never call a test early.
Reusable Audit Template
Use this process for any agentic workflow:
## Token Optimization Audit Template
### 1. Instrumentation (Week 1)
- [ ] Add structured telemetry to every LLM call
- [ ] Log token breakdown by component (system prompt, context, history, diff/input)
- [ ] Log output tokens separately
- [ ] Tag all calls with agent, workflow, task type, input complexity
### 2. Baseline Collection (Weeks 1-2)
- [ ] Collect minimum 500 representative runs
- [ ] Compute mean, median, P75, P95 for each metric
- [ ] Break down token consumption by component
- [ ] Identify the top 3 cost drivers
### 3. Opportunity Identification (Week 2)
- [ ] Check for correctness bugs (session leaks, duplicate context, stale history)
- [ ] Check for prompt caching eligibility (static content > 200 tokens)
- [ ] Measure system prompt token count vs. original design intent
- [ ] Evaluate RAG retrieval relevance (are all chunks used? quality threshold?)
- [ ] Measure output verbosity (output tokens vs. task requirements)
- [ ] Evaluate model routing opportunities (complexity distribution)
### 4. Prioritization
- [ ] Score each opportunity: impact (estimated savings) × 1/risk
- [ ] Ship no-risk items immediately (bugs, caching)
- [ ] Design A/B tests for medium-risk items
- [ ] Defer or reject high-risk items unless savings are critical
### 5. Execution (Weeks 3-6)
- [ ] Ship no-risk items with 48h monitoring
- [ ] Run A/B tests sequentially (avoid simultaneous experiments)
- [ ] Log all results in experiment tracker
- [ ] Validate each shipped change against quality floor
### 6. Monitoring Setup
- [ ] Configure alerts for cost regression, component token spikes, quality floor breach
- [ ] Set monthly benchmark review on team calendar
- [ ] Add prompt linting to CI pipeline
- [ ] Schedule quarterly "prompt hygiene review"
Tip: Run this audit every six months on your most expensive workflows. Systems change organically — new requirements, new context sources, prompt additions from well-meaning developers. A six-month audit cadence catches drift before it becomes a crisis, and the cumulative savings from two audits per year typically justify the investment in the monitoring infrastructure that makes them fast to run.
Summary
This capstone demonstrated a complete token optimization audit workflow on a realistic production system:
- Baseline instrumentation revealed a 57.9% gap between actual and achievable token consumption across five distinct components
- Bug-first investigation found a conversation history leak that was the fastest and most certain win
- Prioritized A/B testing validated system prompt compression, output format constraints, and RAG filtering — all statistically significant, all within quality bounds
- Model routing required an additional pre-classification layer to maintain quality, demonstrating that optimization hypotheses sometimes require iteration
- Cumulative result: $0.38 → $0.092 per review (75.8% reduction), $3,456/month savings, quality maintained at 3.98 vs. 3.80 floor
- Ongoing monitoring with specific alert thresholds ensures gains are preserved through code and prompt changes
The techniques in this capstone — instrumentation, baseline analysis, opportunity mapping, prioritized A/B testing, and monitoring — apply to any agentic workflow: QA automation, sprint planning assistants, documentation generation, incident response agents, and beyond. The numbers will differ; the methodology will not.
Token optimization is not a project with an end date. It is a discipline, a feedback loop, and a team practice that compounds over time. The teams that treat it that way will build AI-powered systems that are sustainably economical — not just at launch, but at scale.