Optimization Feedback Loops | Token Optimization Masterclass

One-time optimizations decay. Prompts drift as systems evolve, new tool definitions accumulate, context windows get populated with new data sources, and team members add "just one more instruction" to system prompts. Without a feedback loop, the token savings you achieved in Month 1 will be eroded by Month 3. A token optimization feedback loop is the organizational and technical infrastructure that makes improvement self-sustaining.

This topic covers the full cycle: how to detect when optimization is needed, how to generate actionable hypotheses, how to run focused experiments, and how to institutionalize learnings so the whole team benefits.

The Four Phases of the Token Optimization Loop

The optimization loop operates on a regular cadence and consists of four phases:

Phase 1 — Measure: Collect and analyze current token consumption data. Surface anomalies, inefficiencies, and regressions.

Phase 2 — Hypothesize: Convert observations from measurement into testable hypotheses about specific changes that could improve efficiency.

Phase 3 — Test: Run controlled experiments (A/B tests or staged rollouts) to validate or invalidate each hypothesis.

Phase 4 — Improve: Ship winning changes, document learnings, update team standards, and retire failed hypotheses.

The loop then restarts. The cadence is typically: daily monitoring, weekly hypothesis review, biweekly experiment completion, monthly standards update.

Tip: Assign a rotating "token optimization owner" role to team members on a two-week rotation. This person is responsible for monitoring the dashboards, triaging anomalies, and presenting findings at the weekly review. Distributing this responsibility prevents it from becoming invisible infrastructure that nobody owns.

Phase 1: Systematic Measurement and Signal Detection

Measurement in the feedback loop goes beyond the analytics dashboards covered in Topic 1. Here, the focus is on detecting signals that trigger the next phase of the loop.

The Three Signal Types

Regression signals: Token consumption increased significantly compared to the previous period with no corresponding increase in task complexity or volume.

def detect_token_regressions(
    workflow_name: str, 
    lookback_days: int = 7,
    regression_threshold: float = 1.15  # 15% increase
) -> list[dict]:
    """
    Detect workflows where recent token usage has regressed
    compared to the trailing 30-day baseline.
    """
    baseline_window = get_token_stats(workflow_name, days_back=37, days_forward=-7)
    recent_window = get_token_stats(workflow_name, days_back=lookback_days)

    regressions = []
    for agent in recent_window:
        baseline_median = baseline_window[agent]["median_tokens"]
        recent_median = recent_window[agent]["median_tokens"]

        if recent_median > baseline_median * regression_threshold:
            regressions.append({
                "agent": agent,
                "baseline_median": baseline_median,
                "recent_median": recent_median,
                "regression_factor": recent_median / baseline_median,
                "estimated_weekly_cost_increase": calculate_cost_delta(
                    baseline_median, recent_median, recent_window[agent]["weekly_volume"]
                )
            })

    return sorted(regressions, key=lambda x: x["estimated_weekly_cost_increase"], reverse=True)

Efficiency gap signals: Some agents are significantly less efficient than peers performing similar tasks. This suggests a prompt or context strategy that can be improved.

def detect_efficiency_gaps(task_type: str) -> list[dict]:
    """
    Among all agents performing the same task type, identify those
    that use significantly more tokens than the median performer.
    """
    agents = get_agents_by_task_type(task_type)

    if len(agents) < 2:
        return []  # Need multiple agents to compare

    median_efficiency = np.median([a["tokens_per_task"] for a in agents])

    gaps = []
    for agent in agents:
        if agent["tokens_per_task"] > median_efficiency * 1.3:  # 30% above median
            gaps.append({
                "agent": agent["name"],
                "tokens_per_task": agent["tokens_per_task"],
                "median_tokens_per_task": median_efficiency,
                "gap_factor": agent["tokens_per_task"] / median_efficiency,
                "weekly_excess_cost": calculate_excess_cost(agent, median_efficiency)
            })

    return gaps

Opportunity signals: Areas where recent model capability improvements or new optimization techniques could yield gains not previously available.

def detect_opportunities() -> list[dict]:
    """
    Identify optimization opportunities based on current usage patterns
    and known optimization techniques.
    """
    opportunities = []

    # Check for agents not using prompt caching
    agents_without_caching = get_agents_where(
        condition="system_prompt_tokens > 500 AND prompt_caching_enabled = false"
    )
    for agent in agents_without_caching:
        potential_savings = agent["weekly_system_prompt_tokens"] * 0.90 * CACHE_DISCOUNT_RATE
        opportunities.append({
            "type": "prompt_caching",
            "agent": agent["name"],
            "estimated_weekly_savings": potential_savings,
            "effort": "low",
            "action": "Enable prompt caching for static system prompt"
        })

    # Check for high output/input ratio (verbosity problem)
    verbose_agents = get_agents_where(
        condition="avg_output_tokens > avg_input_tokens * 0.5 AND task_type = 'analysis'"
    )
    for agent in verbose_agents:
        opportunities.append({
            "type": "output_verbosity",
            "agent": agent["name"],
            "current_output_ratio": agent["output_ratio"],
            "target_output_ratio": 0.15,
            "effort": "medium",
            "action": "Add output length constraints and structured format instructions"
        })

    return opportunities

Tip: Set up a weekly Slack notification (or email digest) that automatically summarizes the top three regressions, top three efficiency gaps, and top three opportunities. Keep it short — three items per category, with estimated dollar impact. Teams that see a "$340 regression this week in the PR review agent" act on it. Teams that have to dig through dashboards to find problems usually don't.

Phase 2: Generating and Prioritizing Hypotheses

Not every signal deserves equal attention. Hypothesis generation is a prioritization exercise as much as a creative one.

The Hypothesis Backlog

Maintain a hypothesis backlog — a structured list of optimization ideas with enough context to act on them. Each item should include:

## Hypothesis: Compress PR Review System Prompt

**Signal Type**: Regression  
**Detection Date**: 2026-05-03  
**Agent**: pr_review_agent  
**Observation**: System prompt grew from 650 tokens to 1,240 tokens after the 
March security guidelines update. Weekly cost increased by $280.

**Hypothesis**: Removing the redundant "context-setting" paragraphs and 
converting the security checklist from prose to a compact bulleted format 
will reduce system prompt tokens from 1,240 to ~600 without degrading 
review quality (current quality score: 4.3/5.0).

**Expected Impact**:
- Token reduction: ~50% of system prompt (640 tokens per call saved)
- Weekly calls: ~8,500
- Weekly savings: ~$95 at current pricing

**Quality Risk**: Low — the security guidance will still be present, 
just more compact. Engineers will still receive the same coverage points.

**Test Design**: A/B test with 400 samples per variant, 7-day run.  
**Quality Guard**: Quality score ≥ 4.0/5.0 (current: 4.3/5.0)  

**Priority Score**: High (high savings, low risk, easy to implement)

Prioritization Matrix

Score each hypothesis on two dimensions: Expected Impact (token/cost savings) and Implementation Risk (probability of quality degradation).

Hypothesis	Impact Score (1-5)	Risk Score (1-5, lower=safer)	Priority
Compress system prompt	4	1	High
Enable prompt caching	5	1	Critical
Reduce RAG chunk count from 8 to 5	3	3	Medium
Switch from GPT-4o to Claude Haiku for classification	5	4	Medium
Remove few-shot examples from code gen prompt	3	4	Low
Add output length limits to summarization prompt	4	2	High

Work the top of the prioritization matrix first. High-impact, low-risk hypotheses (like enabling prompt caching on static system prompts) should be shipped immediately without A/B testing — they are near-zero risk and high reward.

Tip: Separate "no-brainer" optimizations from "needs testing" optimizations. Enabling prompt caching, removing duplicate context, and fixing obvious prompt verbosity issues should be shipped as direct improvements. Save your A/B testing budget for hypotheses where the quality impact is genuinely uncertain.

Phase 3: Running the Optimization Experiment

The experiment phase follows the A/B testing methodology from Topic 2, but within the feedback loop, speed and cadence matter as much as rigor.

Experiment Velocity vs. Rigor Trade-offs

Scenario	Recommended Approach
High-confidence, low-risk change	Ship directly, monitor for 48h, roll back if regression
Medium-confidence, medium-risk	80/20 rollout (80% control, 20% treatment), monitor 3-5 days
Low-confidence, high-risk	Full A/B test with power analysis, run 7-14 days
Safety-critical or revenue-critical	Full A/B test + human evaluation panel + staged rollout

Experiment Tracking in the Feedback Loop

Use a simple experiment tracker that integrates with your team's workflow:

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum

class ExperimentStatus(Enum):
    BACKLOG = "backlog"
    RUNNING = "running"
    ANALYZING = "analyzing"
    SHIPPED = "shipped"
    REJECTED = "rejected"
    INCONCLUSIVE = "inconclusive"

@dataclass
class TokenExperiment:
    id: str
    name: str
    hypothesis: str
    agent_name: str
    signal_type: str  # regression | efficiency_gap | opportunity

    # Setup
    expected_token_reduction_pct: float
    quality_guard_threshold: float
    required_samples: int

    # Execution
    status: ExperimentStatus = ExperimentStatus.BACKLOG
    start_date: datetime = None
    end_date: datetime = None
    actual_samples_collected: int = 0

    # Results
    token_reduction_pct: float = None
    quality_delta: float = None
    p_value: float = None
    monthly_savings_usd: float = None
    decision: str = None

    # Learning
    key_learnings: list[str] = field(default_factory=list)
    follow_up_hypotheses: list[str] = field(default_factory=list)

class ExperimentTracker:
    def __init__(self):
        self.experiments: list[TokenExperiment] = []

    def add_experiment(self, exp: TokenExperiment):
        self.experiments.append(exp)
        self._notify_team(f"New experiment added: {exp.name}")

    def complete_experiment(self, exp_id: str, results: dict):
        exp = self.get_experiment(exp_id)
        exp.status = ExperimentStatus.ANALYZING
        exp.token_reduction_pct = results["token_reduction_pct"]
        exp.quality_delta = results["quality_delta"]
        exp.p_value = results["p_value"]
        exp.monthly_savings_usd = results["monthly_savings_usd"]

        # Auto-decision
        if (results["p_value"] < 0.05 and 
            results["token_reduction_pct"] < -0.10 and
            results["quality_delta"] >= -0.2):
            exp.decision = "SHIP"
            exp.status = ExperimentStatus.SHIPPED
            self._trigger_deployment(exp)
        elif results["p_value"] >= 0.05:
            exp.decision = "INCONCLUSIVE — collect more samples"
            exp.status = ExperimentStatus.INCONCLUSIVE
        else:
            exp.decision = "REJECT — quality degradation"
            exp.status = ExperimentStatus.REJECTED

        self._generate_learning_report(exp)

Tip: Set a maximum experiment lifespan. If an experiment has not reached statistical significance after 21 days, close it as "inconclusive" and move on. Experiments that run too long accumulate "experiment debt" — the codebase contains multiple variants, context windows have likely shifted, and the original hypothesis may no longer be valid.

Phase 4: Shipping, Documenting, and Institutionalizing

Shipping an optimization is not the end of the loop — it is the beginning of the next cycle and a learning opportunity for the team.

The Ship Checklist

Before shipping any prompt/context optimization, verify:

Statistical significance confirmed (p < 0.05)
Quality guard metric within acceptable range
Change is documented in the prompt registry (see Topic 4)
Deployment is staged (10% → 50% → 100% over 48 hours)
Rollback procedure is tested
Post-ship monitoring alert is configured (watch for quality regression over 72 hours)

Staged Rollout Implementation

def staged_rollout_assignment(session_id: str, rollout_stage: str) -> str:
    """
    Assigns users to new vs. old variant based on rollout stage.
    rollout_stage: "10pct" | "50pct" | "100pct"
    """
    hash_value = int(hashlib.md5(session_id.encode()).hexdigest(), 16)
    normalized = hash_value / (2 ** 128)

    thresholds = {
        "10pct": 0.10,
        "50pct": 0.50,
        "100pct": 1.0
    }

    threshold = thresholds[rollout_stage]
    return "new" if normalized < threshold else "old"

def monitor_staged_rollout(agent_name: str, hours: int = 24):
    new_variant_stats = get_recent_stats(agent_name, variant="new", hours=hours)
    old_variant_stats = get_recent_stats(agent_name, variant="old", hours=hours)

    quality_delta = new_variant_stats["quality_score"] - old_variant_stats["quality_score"]
    token_delta = new_variant_stats["median_tokens"] - old_variant_stats["median_tokens"]

    if quality_delta < -0.3:
        trigger_rollback(agent_name)
        alert_team(f"ROLLBACK TRIGGERED: {agent_name} quality degraded by {quality_delta:.2f}")

    return {
        "quality_delta": quality_delta,
        "token_delta": token_delta,
        "status": "healthy" if quality_delta >= -0.2 else "degraded"
    }

Documenting Learnings

Every completed experiment — whether shipped, rejected, or inconclusive — should contribute to the team's knowledge base. The most valuable artifacts are:

The "What Works" Registry: A running list of optimization patterns that have been proven effective in your specific system. Examples:

## Proven Optimizations (Updated: 2026-05-10)

### Output Verbosity Controls
- Adding "Be concise. Limit response to 300 words." reduces output tokens by 28-35%
- Structuring output as JSON with explicit field names reduces verbosity vs. prose by 40%
- Effective across: PR review, ticket summarization, code explanation agents
- Quality impact: -0.1 to 0.0 on 5-point scale (negligible)

### System Prompt Compression
- Converting prose instructions to numbered lists reduces tokens by 30-45%
- Removing "context-setting" introductory paragraphs saves 100-300 tokens with no quality impact
- Role definitions above 50 words show diminishing returns vs. concise alternatives
- Effective across: all agent types

### RAG Context Management
- Reducing top_k from 8 to 5 reduces context tokens by 35% with <5% quality impact on Q&A tasks
- Filtering retrieved chunks below 0.65 cosine similarity saves 15-25% of context tokens
- Chunk size 500 outperforms 1000 for factual Q&A; 1000 outperforms 500 for code generation

The "What Doesn't Work" Registry: Equally important. Prevents teams from re-running failed experiments.

Tip: Celebrate shipped optimizations in your team's communication channel. Post "We saved $X this month from the PR review prompt optimization" alongside engineering metrics like deployment frequency. Making token savings visible creates intrinsic motivation for the team to continue the optimization loop.

Automating the Feedback Loop

As the loop matures, you can automate parts of it to reduce manual overhead.

Automated Regression Detection and Alerting

def daily_optimization_sweep():
    """
    Automated daily sweep of token usage patterns.
    Generates an optimization report and creates hypothesis tickets.
    """
    report = []

    # 1. Detect regressions
    regressions = detect_token_regressions(lookback_days=7)
    for regression in regressions[:3]:  # Top 3 only
        report.append(f"REGRESSION: {regression['agent']} up {regression['regression_factor']:.1f}x")
        create_jira_ticket(
            title=f"Token regression: {regression['agent']}",
            description=format_regression_ticket(regression),
            labels=["token-optimization", "regression"],
            priority="high" if regression["estimated_weekly_cost_increase"] > 100 else "medium"
        )

    # 2. Detect efficiency gaps
    for task_type in ["code_review", "summarization", "test_generation", "qa"]:
        gaps = detect_efficiency_gaps(task_type)
        for gap in gaps:
            report.append(f"EFFICIENCY GAP: {gap['agent']} is {gap['gap_factor']:.1f}x median")

    # 3. Check for new opportunities
    opportunities = detect_opportunities()
    for opp in sorted(opportunities, key=lambda x: x["estimated_weekly_savings"], reverse=True)[:3]:
        report.append(f"OPPORTUNITY: {opp['type']} for {opp['agent']} (~${opp['estimated_weekly_savings']:.0f}/week)")

    # 4. Send digest
    send_team_digest(report, channel="#token-optimization")

    return report

Integrating with CI/CD: Prompt Change Detection

Prevent prompt regressions from reaching production by adding token budget checks to your CI pipeline:

name: Token Budget Check
on: [pull_request]

jobs:
  token-budget:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Check for prompt changes
        id: prompt-diff
        run: |
          git diff origin/main -- "**/*prompt*" "**/*system_prompt*" > prompt_diff.txt
          if [ -s prompt_diff.txt ]; then
            echo "PROMPTS_CHANGED=true" >> $GITHUB_ENV
          fi

      - name: Run token budget test
        if: env.PROMPTS_CHANGED == 'true'
        run: |
          python scripts/token_budget_test.py \
            --agent all \
            --sample-size 50 \
            --max-regression-pct 10 \
            --report-path token_report.json

      - name: Comment results on PR
        if: env.PROMPTS_CHANGED == 'true'
        uses: actions/github-script@v6
        with:
          script: |
            const report = require('./token_report.json');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: formatTokenReport(report)
            });

This CI check runs a sample evaluation on prompt changes and reports the estimated token impact directly on the pull request, before any change reaches production.

Tip: The CI/CD integration is the most impactful automation you can build for the feedback loop. Once engineers see token impact estimates on every prompt-related PR, the culture shifts permanently. Teams start self-optimizing because the cost signal is visible at the point of change, not weeks later in a dashboard.

Summary

The token optimization feedback loop is a continuous process, not a one-time project. The key components are:

Three signal types drive the loop: regressions, efficiency gaps, and opportunities
Maintain a structured hypothesis backlog with prioritization scores
Separate "no-brainer" direct improvements from "needs testing" hypotheses
Use staged rollouts (10% → 50% → 100%) with quality monitoring for every shipped optimization
Document both what works and what doesn't — the "doesn't work" registry prevents wasted effort
Automate regression detection with daily sweeps and CI/CD token budget checks
Assign rotating ownership and celebrate shipped optimizations to sustain team engagement