·

This capstone brings together every technique from the course into a single, end-to-end token optimization audit. You will work through a realistic scenario: a mid-sized software engineering team running an AI-assisted PR review pipeline that has grown organically over eight months and now costs significantly more than it should. You will audit it, diagnose the problems, implement optimizations, validate results, and establish ongoing monitoring — exactly as you would in production.

This walkthrough is designed to be followed hands-on. Every section includes executable code, concrete decisions, and realistic trade-offs.


The Scenario: The PR Review Pipeline

Context: You are the platform engineering lead at a 60-person software company. Eight months ago, your team shipped an AI-assisted PR review system. Engineers submit PRs and the system automatically generates a code review covering security, performance, logic, and style.

Current state of the pipeline:
- 400 PRs reviewed per day across 4 engineering squads
- Average cost: $0.38 per PR review
- Monthly cost: approximately $4,560
- Initial target: $0.12 per PR review (established when the system was first specced)
- Quality score: 4.1/5.0 (measured by quarterly engineer surveys)

Your mandate: Reduce cost to ≤ $0.15 per PR review without dropping quality below 3.8/5.0.

The pipeline architecture:

PR Submitted → Context Builder → System Prompt Builder → LLM (GPT-4o) → Review Formatter → Engineer
                    ↑                                          ↑
              [RAG: codebase     [Static guidelines +    [Multi-turn for
               conventions]       team-specific rules]    clarification]

Phase 1: Establishing the Baseline

Before touching anything, you need a complete picture of current token consumption. This is non-negotiable — optimization without a baseline is navigation without a map.

Step 1.1: Deploy Instrumentation

The pipeline was built eight months ago and has minimal observability. Your first task is to add structured telemetry without changing any prompts or logic.

import time
import uuid
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
import json

@dataclass
class PRReviewTrace:
    trace_id: str
    timestamp_utc: str
    pr_id: str
    squad: str
    pr_size_category: str  # small (<50 lines), medium (50-300), large (>300)

    # Context breakdown
    system_prompt_tokens: int
    rag_context_tokens: int
    pr_diff_tokens: int
    conversation_history_tokens: int
    total_input_tokens: int

    # Output
    output_tokens: int
    total_tokens: int

    # Turn tracking
    turn_count: int

    # Cost
    cost_usd: float
    model: str

    # Performance
    latency_ms: int

    # Downstream
    quality_score: float = None  # Populated later from survey data

def wrap_pr_review_with_tracing(pr_id: str, squad: str, pr_diff: str):
    trace_id = str(uuid.uuid4())
    start_time = time.time()

    # Build context (existing code, just instrument it)
    system_prompt = build_system_prompt()
    rag_context = retrieve_codebase_context(pr_diff)
    conversation_history = get_conversation_history(pr_id)

    # Count tokens before LLM call
    sp_tokens = count_tokens(system_prompt)
    rag_tokens = count_tokens(rag_context)
    diff_tokens = count_tokens(pr_diff)
    history_tokens = count_tokens(conversation_history)

    # Make LLM call (existing code)
    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=build_messages(system_prompt, rag_context, pr_diff, conversation_history)
    )

    latency_ms = int((time.time() - start_time) * 1000)
    cost = calculate_cost(response.usage, "gpt-4o")

    # Write trace
    trace = PRReviewTrace(
        trace_id=trace_id,
        timestamp_utc=datetime.now(timezone.utc).isoformat(),
        pr_id=pr_id,
        squad=squad,
        pr_size_category=categorize_pr_size(pr_diff),
        system_prompt_tokens=sp_tokens,
        rag_context_tokens=rag_tokens,
        pr_diff_tokens=diff_tokens,
        conversation_history_tokens=history_tokens,
        total_input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
        total_tokens=response.usage.total_tokens,
        turn_count=count_turns(conversation_history),
        cost_usd=cost,
        model="gpt-4o",
        latency_ms=latency_ms
    )

    write_to_analytics_store(asdict(trace))
    return response.choices[0].message.content

Run this instrumentation for two weeks without any changes. You need a clean baseline.

Step 1.2: Analyze the Baseline Data

After two weeks, run the baseline analysis:

import pandas as pd
import numpy as np

def generate_baseline_report(df: pd.DataFrame) -> dict:
    """
    Comprehensive baseline report for the PR review pipeline.
    """
    # Overall statistics
    overall = {
        "total_reviews": len(df),
        "daily_average": len(df) / 14,
        "avg_total_tokens": df["total_tokens"].mean(),
        "median_total_tokens": df["total_tokens"].median(),
        "p95_total_tokens": df["total_tokens"].quantile(0.95),
        "avg_cost_usd": df["cost_usd"].mean(),
        "total_cost_two_weeks": df["cost_usd"].sum(),
        "monthly_projected_cost": df["cost_usd"].sum() * 2.17
    }

    # Token breakdown
    token_breakdown = {
        "system_prompt": {
            "avg": df["system_prompt_tokens"].mean(),
            "pct_of_input": df["system_prompt_tokens"].mean() / df["total_input_tokens"].mean()
        },
        "rag_context": {
            "avg": df["rag_context_tokens"].mean(),
            "pct_of_input": df["rag_context_tokens"].mean() / df["total_input_tokens"].mean()
        },
        "pr_diff": {
            "avg": df["pr_diff_tokens"].mean(),
            "pct_of_input": df["pr_diff_tokens"].mean() / df["total_input_tokens"].mean()
        },
        "conversation_history": {
            "avg": df["conversation_history_tokens"].mean(),
            "pct_of_input": df["conversation_history_tokens"].mean() / df["total_input_tokens"].mean()
        }
    }

    # By PR size
    by_size = df.groupby("pr_size_category").agg({
        "total_tokens": ["mean", "count"],
        "cost_usd": "mean",
        "turn_count": "mean"
    }).round(2)

    # By squad
    by_squad = df.groupby("squad").agg({
        "total_tokens": "mean",
        "cost_usd": ["mean", "sum"],
        "turn_count": "mean"
    }).round(2)

    return {
        "overall": overall,
        "token_breakdown": token_breakdown,
        "by_size": by_size.to_dict(),
        "by_squad": by_squad.to_dict()
    }

Baseline findings (realistic for this scenario):

Component Average Tokens % of Total Input
System prompt 1,847 19.3%
RAG context 3,420 35.7%
PR diff 3,890 40.6%
Conversation history 418 4.4%
Total input 9,575 100%
Output 1,840
Grand total 11,415

Key observations from the baseline:

  1. The system prompt is 1,847 tokens — nearly 3x the expected size (the original design called for ~650 tokens)
  2. RAG context is 3,420 tokens, pulling from 8 chunks of 500 tokens each — the pipeline never had a retrieval quality filter
  3. Conversation history is present even on first-turn reviews (a bug: stale sessions are not being cleared)
  4. The output is averaging 1,840 tokens — the review format is not constraining response length
  5. Multi-turn usage: 23% of reviews involve 2+ turns; these average 18,500 total tokens vs. 10,100 for single-turn

Tip: Always break down token consumption by component before hypothesizing fixes. In this scenario, a developer who "just looks at the system prompt" would miss the RAG inefficiency and the conversation history bug — together worth more savings than the system prompt compression.


Phase 2: Identifying and Prioritizing Optimization Opportunities

Step 2.1: Build the Opportunity Map

From the baseline analysis, we can quantify the opportunity for each component:

OPTIMIZATION_OPPORTUNITIES = [
    {
        "id": "OPT-001",
        "name": "Fix conversation history bug",
        "component": "conversation_history",
        "type": "bug_fix",
        "description": "Stale session history is being injected even on first-turn reviews",
        "baseline_avg_tokens": 418,
        "expected_post_fix_tokens": 0,  # First-turn reviews should have zero history
        "affected_pct": 0.77,  # 77% of reviews are single-turn
        "implementation_effort": "low",
        "quality_risk": "none",
        "weekly_token_savings": 418 * 0.77 * 400 * 7,  # ~900K tokens/week
        "weekly_cost_savings": None  # Calculate below
    },
    {
        "id": "OPT-002", 
        "name": "Compress system prompt",
        "component": "system_prompt",
        "type": "optimization",
        "description": "System prompt grew from 650 to 1,847 tokens over 8 months via feature additions. "
                      "Restructuring as compact numbered lists should reach ~700 tokens.",
        "baseline_avg_tokens": 1847,
        "expected_post_opt_tokens": 700,
        "implementation_effort": "medium",
        "quality_risk": "low",
        "ab_test_required": True
    },
    {
        "id": "OPT-003",
        "name": "Enable prompt caching for system prompt",
        "component": "system_prompt",
        "type": "infrastructure",
        "description": "System prompt is static. Enabling Anthropic/OpenAI prompt caching "
                      "would reduce effective cost of system prompt tokens by 90%.",
        "implementation_effort": "low",
        "quality_risk": "none",
        "depends_on": "OPT-002"  # More valuable after compression
    },
    {
        "id": "OPT-004",
        "name": "Add RAG relevance filtering",
        "component": "rag_context",
        "type": "optimization",
        "description": "Currently retrieves top-8 chunks with no relevance threshold. "
                      "Filtering to cosine similarity > 0.65 and top-5 would cut RAG tokens by ~40%.",
        "baseline_avg_tokens": 3420,
        "expected_post_opt_tokens": 2050,
        "implementation_effort": "medium",
        "quality_risk": "medium",
        "ab_test_required": True
    },
    {
        "id": "OPT-005",
        "name": "Add output format constraints",
        "component": "output",
        "type": "optimization",
        "description": "Reviews are currently unstructured prose. Adding explicit format "
                      "(JSON schema or compact template) and word limits would reduce output tokens.",
        "baseline_avg_tokens": 1840,
        "expected_post_opt_tokens": 950,
        "implementation_effort": "low",
        "quality_risk": "medium",
        "ab_test_required": True
    },
    {
        "id": "OPT-006",
        "name": "Route small PRs to GPT-4o-mini",
        "component": "model",
        "type": "model_routing",
        "description": "Small PRs (<50 lines, ~30% of volume) may be adequately reviewed "
                      "by GPT-4o-mini at 10x lower cost.",
        "affected_pct": 0.30,
        "implementation_effort": "medium",
        "quality_risk": "medium",
        "ab_test_required": True
    }
]

Step 2.2: Prioritize by Impact × Risk

ID Opportunity Weekly Savings (est.) Risk Priority
OPT-001 Fix history bug $280 None P0 — Ship immediately
OPT-003 Enable prompt caching $180 None P0 — Ship immediately
OPT-002 Compress system prompt $340 Low P1 — A/B test first
OPT-005 Output format constraints $360 Medium P1 — A/B test first
OPT-004 RAG relevance filtering $220 Medium P2 — A/B test first
OPT-006 Route small PRs to mini model $210 Medium P2 — A/B test first

Total estimated weekly savings if all shipped: $1,590/week ($6,900/month)

This would bring cost per review from $0.38 to approximately $0.095 — well under the $0.15 target.

Tip: Always ship no-risk optimizations (bug fixes, prompt caching) first, before running A/B tests. They establish a new, lower baseline, which means your A/B tests start from an already-improved state. The percentage improvements from subsequent optimizations are calculated on a smaller number, which makes them more statistically detectable.


Phase 3: Implementing the Optimizations

Step 3.1: Ship OPT-001 — Fix the History Bug

Investigation reveals that the session management code was reusing session objects across PR submissions:

class PRReviewSession:
    _instance = None

    def __init__(self):
        self.history = []

    @classmethod
    def get_session(cls, pr_id: str):
        if cls._instance is None:
            cls._instance = PRReviewSession()
        return cls._instance  # BUG: Same instance for all PRs!

class PRReviewSession:
    _sessions = {}

    @classmethod
    def get_session(cls, pr_id: str):
        if pr_id not in cls._sessions:
            cls._sessions[pr_id] = PRReviewSession()
        return cls._sessions[pr_id]

    @classmethod
    def clear_session(cls, pr_id: str):
        cls._sessions.pop(pr_id, None)

Deploy, monitor for 48 hours, confirm conversation history tokens drop to ~0 for first-turn reviews.

Measured result after 48 hours: Average conversation history tokens: 22 (near-zero, down from 418). Cost per review: $0.32 (down from $0.38).

Step 3.2: Enable Prompt Caching (OPT-003)

def build_messages_with_caching(system_prompt: str, user_content: str) -> list:
    return [
        {
            "role": "system",
            "content": system_prompt  # OpenAI automatically caches if identical
        },
        {
            "role": "user", 
            "content": user_content
        }
    ]

def build_anthropic_messages_with_caching(system_prompt: str, user_content: str) -> dict:
    return {
        "system": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}  # Enable prompt caching
            }
        ],
        "messages": [
            {"role": "user", "content": user_content}
        ]
    }

Measured result after 72 hours: System prompt cache hit rate: 94%. Effective cost of system prompt tokens reduced by ~85%. Cost per review: $0.27 (down from $0.32).

Step 3.3: Run A/B Test for System Prompt Compression (OPT-002)

Before (1,847 tokens): The current system prompt includes role preamble, extensive context-setting paragraphs, detailed review guidelines in prose form, a security checklist in paragraph format, output format description, and closing caveats.

After (714 tokens): Compressed version removes preamble, converts prose to numbered lists, uses compact security checklist.


You are an expert senior software engineer with deep experience in code review. 
Your role is to provide comprehensive, actionable feedback on code changes submitted 
as pull requests. You have extensive knowledge of software security, performance 
optimization, and code quality best practices. When reviewing code, you approach 
it as a mentor who wants to help the team improve their craft while ensuring the 
codebase remains maintainable and secure.

When conducting your review, please carefully consider the following aspects:

**Correctness and Logic**: Examine the code for logical errors, off-by-one errors, 
null pointer dereferences, race conditions, and other correctness issues. Consider 
edge cases that the developer may have missed...

[continues for 1,847 tokens total]

---


Senior software engineer. Review PR diffs for:

1. **Correctness**: Logic errors, edge cases, null handling, race conditions
2. **Security**: Injection (SQL/XSS/cmd), auth/authz flaws, secrets exposure, 
   insecure deps, input validation
3. **Performance**: N+1 queries, missing indexes, inefficient algorithms, 
   memory leaks, blocking I/O
4. **Maintainability**: Naming clarity, function size, duplication, test coverage
5. **Style**: Consistency with existing patterns

**Output format** (strict):
- **Summary** (2-3 sentences): Overall assessment
- **Critical** (block merge): Numbered list. Each item: file:line — issue — fix
- **Suggestions** (non-blocking): Numbered list. Same format
- **Positives** (1-3 items): What was done well

Max 400 words total. Be specific. Reference exact line numbers.

A/B Test Setup:
- Traffic split: 50/50
- Run duration: 10 days
- Required samples: 500 per variant (calculated via power analysis)
- Primary metric: total tokens per review
- Guard metric: quality score from LLM-as-judge (using GPT-4o-mini as evaluator)

def run_system_prompt_ab_test(pr_id: str, pr_diff: str, squad: str) -> dict:
    variant = assign_variant(pr_id, "system_prompt_compression_v1")

    if variant == "control":
        system_prompt = load_prompt("control_system_prompt_v1847")
    else:
        system_prompt = load_prompt("treatment_system_prompt_v714")

    response = call_gpt4o(system_prompt, pr_diff)

    # LLM-as-judge quality evaluation (async, does not block response)
    quality_score = evaluate_review_quality_async(
        review=response.content,
        pr_diff=pr_diff,
        criteria=["specificity", "actionability", "coverage", "correctness"]
    )

    log_experiment_result(
        experiment="system_prompt_compression_v1",
        variant=variant,
        pr_id=pr_id,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
        quality_score=quality_score
    )

    return {"review": response.content, "variant": variant}

A/B Test Results (after 10 days, 1,247 reviews per variant):

Metric Control Treatment Change
Avg system prompt tokens 1,847 714 -61.3%
Avg total input tokens 8,982 7,849 -12.6%
Avg output tokens 1,840 1,120 -39.1%
Avg total tokens 10,822 8,969 -17.1%
Avg quality score 4.10 4.05 -0.05
P-value (Mann-Whitney U) 0.0003

Decision: SHIP treatment. Token reduction is statistically significant and meaningful (-17.1%). Quality delta is -0.05, well within the acceptable range (floor is 3.8; treatment is 4.05).

Step 3.4: A/B Test Output Format Constraints (OPT-005)

The treatment system prompt already includes "Max 400 words total" but this alone may not be sufficient. Test an explicit output schema:

OUTPUT_FORMAT_TREATMENT = """
Respond ONLY with valid JSON in this exact structure. No prose outside the JSON.

{
  "summary": "string (max 50 words)",
  "critical_issues": [
    {"file": "string", "line": "integer or range", "issue": "string", "fix": "string"}
  ],
  "suggestions": [
    {"file": "string", "line": "integer or range", "issue": "string", "fix": "string"}
  ],
  "positives": ["string", "string"]
}
"""

A/B Test Results for output format:

Metric Prose format JSON format Change
Avg output tokens 1,120 680 -39.3%
Quality (LLM judge) 4.05 4.08 +0.03
Parsing failure rate 0% 1.2% +1.2%

Decision: SHIP JSON format. Output token reduction is substantial. Small quality improvement. Parsing failures (1.2%) are acceptable and can be handled with a fallback parser.

Step 3.5: A/B Test RAG Relevance Filtering (OPT-004)

def retrieve_with_relevance_filter(
    query: str, 
    variant: str
) -> tuple[list[str], int]:

    if variant == "control":
        # Current: top-8, no relevance filter
        chunks = vector_store.similarity_search(query, k=8)
    else:
        # Treatment: top-5, relevance threshold 0.65
        results = vector_store.similarity_search_with_score(query, k=10)
        chunks = [
            doc for doc, score in results 
            if score >= 0.65
        ][:5]  # Max 5 chunks after filtering

    return chunks, count_tokens(" ".join([c.page_content for c in chunks]))

A/B Test Results for RAG filtering:

Metric Control (k=8, no filter) Treatment (k=5, threshold=0.65) Change
Avg RAG context tokens 2,890 1,720 -40.5%
Avg total tokens 8,969 7,799 -13.1%
Quality (LLM judge) 4.05 4.01 -0.04
Retrieval recall (key issues found) 91% 88% -3%

Decision: SHIP with monitoring. The 3% reduction in retrieval recall is a concern. Monitor engineer satisfaction scores for 30 days post-ship. If satisfaction drops, investigate and potentially tune the threshold to 0.60.

Step 3.6: A/B Test Model Routing for Small PRs (OPT-006)

def route_model(pr_diff: str, pr_size_category: str, variant: str) -> str:
    if variant == "treatment" and pr_size_category == "small":
        return "gpt-4o-mini"  # 10x cheaper for small PRs
    return "gpt-4o"  # Default for all sizes in control, and large/medium in treatment

eval_results = {
    "gpt-4o": {"avg_quality": 4.08, "p5_quality": 3.60, "avg_tokens": 7,800, "avg_cost": 0.0195},
    "gpt-4o-mini": {"avg_quality": 3.81, "p5_quality": 3.10, "avg_tokens": 6,200, "avg_cost": 0.0022}
}

A/B Test Results for model routing:

Metric Control (GPT-4o) Treatment (mini for small PRs) Change
Avg quality (small PRs) 4.08 3.81 -0.27
P5 quality (small PRs) 3.60 3.10 -0.50
Avg cost (small PRs) $0.0195 $0.0022 -88.7%

Decision: REJECT for general routing. P5 quality of 3.10 is below the 3.80 floor. However, test a tiered approach: route only "trivial" small PRs (documentation changes, config updates, single-line fixes) to GPT-4o-mini, with a PR classification step that costs ~$0.001 to route.

def classify_pr_complexity(pr_diff: str) -> str:
    """Quick classification using GPT-4o-mini to route PR to appropriate model."""
    response = gpt4o_mini_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Classify this PR diff as 'trivial' (docs, config, single-line) "
                      f"or 'substantive' (logic changes, new features, bug fixes). "
                      f"Respond with only the word.\n\n{pr_diff[:1000]}"
        }],
        max_tokens=5
    )
    return response.choices[0].message.content.strip().lower()

def smart_model_router(pr_diff: str, pr_size_category: str) -> str:
    if pr_size_category == "small":
        complexity = classify_pr_complexity(pr_diff)  # Costs ~$0.001
        if complexity == "trivial":
            return "gpt-4o-mini"  # Save $0.018 per trivial small PR
    return "gpt-4o"

Re-test result: Trivial PRs (12% of total volume) reviewed by GPT-4o-mini achieve 4.01 average quality — above the 3.80 floor. Ship.


Phase 4: Measuring the Cumulative Impact

After all optimizations are shipped (staggered over 6 weeks), run the final comparison:

Cumulative Results Dashboard

def generate_optimization_results_report(
    baseline_df: pd.DataFrame,
    post_optimization_df: pd.DataFrame
) -> dict:

    baseline = {
        "avg_total_tokens": baseline_df["total_tokens"].mean(),
        "avg_cost_usd": baseline_df["cost_usd"].mean(),
        "daily_volume": len(baseline_df) / 14,
        "monthly_projected_cost": baseline_df["cost_usd"].mean() * 400 * 30
    }

    post_opt = {
        "avg_total_tokens": post_optimization_df["total_tokens"].mean(),
        "avg_cost_usd": post_optimization_df["cost_usd"].mean(),
        "daily_volume": len(post_optimization_df) / 14,
        "monthly_projected_cost": post_optimization_df["cost_usd"].mean() * 400 * 30
    }

    return {
        "baseline": baseline,
        "post_optimization": post_opt,
        "deltas": {
            "token_reduction_pct": (post_opt["avg_total_tokens"] - baseline["avg_total_tokens"]) / baseline["avg_total_tokens"],
            "cost_reduction_pct": (post_opt["avg_cost_usd"] - baseline["avg_cost_usd"]) / baseline["avg_cost_usd"],
            "monthly_savings": baseline["monthly_projected_cost"] - post_opt["monthly_projected_cost"]
        }
    }

Final Results:

Metric Baseline Post-Optimization Change
Avg system prompt tokens 1,847 714 -61.3%
Avg RAG context tokens 3,420 1,720 -49.7%
Avg history tokens 418 22 -94.7%
Avg output tokens 1,840 680 -63.0%
Avg total tokens 11,415 4,812 -57.9%
Avg cost per review $0.38 $0.092 -75.8%
Monthly cost $4,560 $1,104 -$3,456
Quality score 4.10 3.98 -0.12

Target achieved: Cost of $0.092 is well below the $0.15 target. Quality score of 3.98 is above the 3.80 floor.

Token breakdown of optimizations:

Optimization Token Savings % of Total Savings
OPT-001: History bug fix 396/review 5.9%
OPT-003: Prompt caching ~1,570 effective cost reduction
OPT-002: System prompt compression 1,133/review 16.9%
OPT-005: Output format constraints 1,160/review 17.3%
OPT-004: RAG relevance filtering 1,170/review 17.5%
OPT-006: Trivial PR routing ~$0.018/trivial PR
Total direct token reduction 6,603/review 57.9%

Tip: When presenting optimization results to stakeholders, always show both the token reduction and the dollar impact. Engineering leaders respond to token percentages; business stakeholders respond to dollars. "$3,456 per month saved from a 6-week optimization effort" is more compelling than "57.9% token reduction" for executive reporting — even though they represent the same achievement.


Phase 5: Establishing Ongoing Monitoring

The audit is complete but the work is not. A token optimization without ongoing monitoring will regress. Set up the monitoring infrastructure that ensures these gains are preserved.

Monitoring Configuration

ALERT_RULES = [
    {
        "name": "cost_per_review_regression",
        "condition": "avg_cost_usd_24h > 0.15",
        "severity": "critical",
        "message": "PR review cost exceeded $0.15 threshold (target: $0.092)",
        "action": "page_oncall"
    },
    {
        "name": "system_prompt_token_spike",
        "condition": "avg_system_prompt_tokens_24h > 900",
        "severity": "warning",
        "message": "System prompt tokens exceeded 900 (baseline: 714). Check for prompt changes.",
        "action": "slack_alert"
    },
    {
        "name": "rag_context_regression",
        "condition": "avg_rag_context_tokens_24h > 2500",
        "severity": "warning",
        "message": "RAG context tokens up significantly. Check relevance filtering.",
        "action": "slack_alert"
    },
    {
        "name": "quality_floor_breach",
        "condition": "avg_quality_score_24h < 3.80",
        "severity": "critical",
        "message": "Review quality dropped below 3.80 floor. Investigate immediately.",
        "action": "page_oncall"
    },
    {
        "name": "conversation_history_leak",
        "condition": "avg_conversation_history_tokens_24h > 100",
        "severity": "warning",
        "message": "Conversation history tokens rising. Session bug may have recurred.",
        "action": "slack_alert"
    }
]

def check_all_alerts():
    stats = get_24h_stats("pr_review_pipeline")

    triggered = []
    for rule in ALERT_RULES:
        if evaluate_condition(rule["condition"], stats):
            triggered.append(rule)
            fire_alert(rule)

    return triggered

Monthly Benchmark Review

## Monthly Token Optimization Review — PR Review Pipeline

### Benchmark Targets (set post-optimization)
| Metric | Green | Yellow | Red |
|--------|-------|--------|-----|
| Avg cost per review | ≤ $0.12 | $0.12–$0.15 | > $0.15 |
| Avg total tokens | ≤ 5,500 | 5,500–7,000 | > 7,000 |
| System prompt tokens | ≤ 800 | 800–1,000 | > 1,000 |
| RAG context tokens | ≤ 2,000 | 2,000–2,500 | > 2,500 |
| Quality score | ≥ 4.00 | 3.80–4.00 | < 3.80 |

### Review Process (Monthly)
1. Pull 30-day stats from analytics store
2. Compare against benchmark targets
3. Identify any metrics in Yellow or Red
4. For each Yellow/Red metric: create hypothesis ticket
5. Review experiment backlog: what is ready to ship?
6. Update the "What Works" registry with any new learnings

Phase 6: Capstone Retrospective and Reusable Artifacts

Lessons Learned

Finding 1: Bugs beat optimizations. The conversation history bug (OPT-001) was the fastest fix and among the most impactful. Always run a "correctness audit" before starting optimization work. Look for tokens being consumed by behavior that is simply wrong.

Finding 2: Organic prompt growth is the most common problem. The system prompt grew from 650 to 1,847 tokens over 8 months with no single large change — just accumulation. This is the most common cause of token cost inflation in production systems. A periodic "prompt hygiene review" would have caught this at 900 tokens rather than 1,847.

Finding 3: Output constraints outperformed input constraints. Reducing output tokens from 1,840 to 680 (a 63% reduction) had a larger impact than compressing the system prompt (61% reduction) despite the system prompt having more absolute tokens, because output tokens are priced 4–5x higher than input tokens.

Finding 4: Model routing needs a pre-classifier. Naively routing small PRs to a cheaper model failed the quality floor. A lightweight pre-classification step ($0.001) unlocked the routing saving by distinguishing trivial from substantive small PRs.

Finding 5: A/B testing duration matters. The system prompt A/B test would have been inconclusive after 3 days. It took 7 days to reach statistical significance. Never call a test early.

Reusable Audit Template

Use this process for any agentic workflow:

## Token Optimization Audit Template

### 1. Instrumentation (Week 1)
- [ ] Add structured telemetry to every LLM call
- [ ] Log token breakdown by component (system prompt, context, history, diff/input)
- [ ] Log output tokens separately
- [ ] Tag all calls with agent, workflow, task type, input complexity

### 2. Baseline Collection (Weeks 1-2)
- [ ] Collect minimum 500 representative runs
- [ ] Compute mean, median, P75, P95 for each metric
- [ ] Break down token consumption by component
- [ ] Identify the top 3 cost drivers

### 3. Opportunity Identification (Week 2)
- [ ] Check for correctness bugs (session leaks, duplicate context, stale history)
- [ ] Check for prompt caching eligibility (static content > 200 tokens)
- [ ] Measure system prompt token count vs. original design intent
- [ ] Evaluate RAG retrieval relevance (are all chunks used? quality threshold?)
- [ ] Measure output verbosity (output tokens vs. task requirements)
- [ ] Evaluate model routing opportunities (complexity distribution)

### 4. Prioritization
- [ ] Score each opportunity: impact (estimated savings) × 1/risk
- [ ] Ship no-risk items immediately (bugs, caching)
- [ ] Design A/B tests for medium-risk items
- [ ] Defer or reject high-risk items unless savings are critical

### 5. Execution (Weeks 3-6)
- [ ] Ship no-risk items with 48h monitoring
- [ ] Run A/B tests sequentially (avoid simultaneous experiments)
- [ ] Log all results in experiment tracker
- [ ] Validate each shipped change against quality floor

### 6. Monitoring Setup
- [ ] Configure alerts for cost regression, component token spikes, quality floor breach
- [ ] Set monthly benchmark review on team calendar
- [ ] Add prompt linting to CI pipeline
- [ ] Schedule quarterly "prompt hygiene review"

Tip: Run this audit every six months on your most expensive workflows. Systems change organically — new requirements, new context sources, prompt additions from well-meaning developers. A six-month audit cadence catches drift before it becomes a crisis, and the cumulative savings from two audits per year typically justify the investment in the monitoring infrastructure that makes them fast to run.


Summary

This capstone demonstrated a complete token optimization audit workflow on a realistic production system:

  1. Baseline instrumentation revealed a 57.9% gap between actual and achievable token consumption across five distinct components
  2. Bug-first investigation found a conversation history leak that was the fastest and most certain win
  3. Prioritized A/B testing validated system prompt compression, output format constraints, and RAG filtering — all statistically significant, all within quality bounds
  4. Model routing required an additional pre-classification layer to maintain quality, demonstrating that optimization hypotheses sometimes require iteration
  5. Cumulative result: $0.38 → $0.092 per review (75.8% reduction), $3,456/month savings, quality maintained at 3.98 vs. 3.80 floor
  6. Ongoing monitoring with specific alert thresholds ensures gains are preserved through code and prompt changes

The techniques in this capstone — instrumentation, baseline analysis, opportunity mapping, prioritized A/B testing, and monitoring — apply to any agentic workflow: QA automation, sprint planning assistants, documentation generation, incident response agents, and beyond. The numbers will differ; the methodology will not.

Token optimization is not a project with an end date. It is a discipline, a feedback loop, and a team practice that compounds over time. The teams that treat it that way will build AI-powered systems that are sustainably economical — not just at launch, but at scale.