·

You cannot optimize what you do not measure. Before applying any of the optimization techniques in subsequent modules, you need an accurate baseline: how many tokens does your current system actually consume, where does it spend them, and how does that translate to cost and latency? This topic builds a complete token measurement infrastructure — from local development tooling to production dashboards — that gives you the visibility needed to make data-driven optimization decisions.


The Token Measurement Stack — What You Need and Why

A complete token measurement stack has four layers, each serving a different purpose and audience:

Layer 1: Request-level instrumentation — Logs every API call with full token metadata. Answers: "What did this specific call cost?"

Layer 2: Session-level aggregation — Aggregates token data across all calls in one agent session. Answers: "What did completing this task cost?"

Layer 3: Workflow-level analytics — Groups sessions by task type, user, feature, and environment. Answers: "What does our code review feature cost per day?"

Layer 4: Trend dashboards — Visualizes token usage over time, surfaces anomalies, and compares before/after optimization. Answers: "Is our agent becoming more or less efficient as we ship changes?"

Most teams start at Layer 1 and stay there, making it impossible to answer Layer 3 and Layer 4 questions that drive real business decisions. This topic builds all four.

Tip: Before building custom instrumentation, check what your LLM API provider already gives you. Anthropic's Workbench and the Usage API, OpenAI's Usage dashboard, and LangSmith all provide Layer 3–4 data out of the box for many use cases. Avoid reinventing the wheel — custom instrumentation is most valuable for correlation with your own business metrics (e.g., "tokens per Jira ticket resolved").


Provider-Native Token Reporting — Starting with What You Have

Every major LLM provider exposes token usage data in their API responses and management consoles. Extracting and logging this data requires zero additional infrastructure.

Anthropic — usage data in API response:

import anthropic
import json
from datetime import datetime, timezone

client = anthropic.Anthropic()

def call_with_logging(
    messages: list[dict],
    system: str = "",
    model: str = "claude-sonnet-4-5",
    max_tokens: int = 1024,
    task_label: str = "untagged",
    metadata: dict | None = None
) -> tuple[str, dict]:
    """
    Make an API call and return (response_text, usage_record).
    The usage_record is ready to be written to your logging system.
    """
    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system=system,
        messages=messages
    )

    usage = response.usage
    record = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "model": model,
        "task_label": task_label,
        "metadata": metadata or {},
        "tokens": {
            "input": usage.input_tokens,
            "output": usage.output_tokens,
            "cache_read": getattr(usage, "cache_read_input_tokens", 0),
            "cache_write": getattr(usage, "cache_creation_input_tokens", 0),
            "total": usage.input_tokens + usage.output_tokens,
        },
        "stop_reason": response.stop_reason,
        "request_id": response.id,
    }

    return response.content[0].text, record


response_text, usage_record = call_with_logging(
    messages=[{"role": "user", "content": "Generate test cases for user registration."}],
    system="You are a QA engineer. Generate structured test cases.",
    task_label="test_generation",
    metadata={
        "feature": "user_registration",
        "sprint": "2026-Q2-S3",
        "team": "platform",
        "environment": "development",
        "jira_ticket": "PLAT-1247"
    }
)

print(json.dumps(usage_record, indent=2))

OpenAI — usage data extraction:

from openai import OpenAI

openai_client = OpenAI()

def openai_call_with_logging(
    messages: list[dict],
    model: str = "gpt-4o",
    task_label: str = "untagged"
) -> tuple[str, dict]:
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages
    )

    usage = response.usage
    record = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "model": model,
        "task_label": task_label,
        "tokens": {
            "input": usage.prompt_tokens,
            "output": usage.completion_tokens,
            "cached": getattr(usage.prompt_tokens_details, "cached_tokens", 0),
            "reasoning": getattr(usage.completion_tokens_details, "reasoning_tokens", 0),
            "total": usage.total_tokens,
        },
        "finish_reason": response.choices[0].finish_reason,
        "request_id": response.id,
    }

    return response.choices[0].message.content, record

Tip: Tag every API call with at minimum these four metadata fields: (1) task_label — what kind of task this is (code_review, test_generation, sprint_planning), (2) feature — which product feature triggered this call, (3) environment — dev/staging/prod, and (4) session_id — to group related calls. These tags are the foundation of meaningful usage analytics and are almost impossible to add retroactively to historical data.


LangSmith — The Purpose-Built Agent Observability Platform

LangSmith (by LangChain) is currently the most mature observability platform for LLM applications. It provides automatic token tracking, latency measurement, trace visualization, and comparison features out of the box.

Setting up LangSmith tracing with any LLM application:

pip install langsmith langchain-anthropic langchain-openai
import os
from langsmith import Client, traceable
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "your_langsmith_api_key"
os.environ["LANGCHAIN_PROJECT"] = "token-optimization-course"

llm = ChatAnthropic(model="claude-sonnet-4-5")

@traceable(name="code_review_agent", tags=["agent", "code_review"])
def run_code_review(pr_diff: str, ticket_id: str) -> str:
    """This function call will appear as a trace in LangSmith with full token breakdown."""
    messages = [
        SystemMessage(content="You are a senior code reviewer. Be concise and actionable."),
        HumanMessage(content=f"Review this PR:\n{pr_diff}")
    ]

    response = llm.invoke(messages, config={
        "metadata": {
            "ticket_id": ticket_id,
            "team": "backend"
        }
    })
    return response.content

result = run_code_review("[PR diff content]", ticket_id="BE-4521")

What LangSmith gives you automatically:
- Token count per call (input, output, cached)
- Latency per call and total session latency
- Cost estimation based on model pricing
- Trace tree showing the full call hierarchy for multi-step agents
- Side-by-side comparison of different prompt versions
- Dataset creation from production traces for regression testing

Querying LangSmith usage data programmatically:

from langsmith import Client
from datetime import datetime, timedelta

ls_client = Client()

runs = ls_client.list_runs(
    project_name="token-optimization-course",
    start_time=datetime.utcnow() - timedelta(days=7),
    run_type="llm",  # Only LLM calls, not chain runs
)

from collections import defaultdict

usage_by_task = defaultdict(lambda: {"input": 0, "output": 0, "count": 0})

for run in runs:
    task_label = run.tags[0] if run.tags else "untagged"
    token_usage = run.total_tokens or {}

    usage_by_task[task_label]["input"] += token_usage.get("input_tokens", 0)
    usage_by_task[task_label]["output"] += token_usage.get("output_tokens", 0)
    usage_by_task[task_label]["count"] += 1

print("Weekly Token Usage by Task Type")
print("=" * 60)
for task, data in sorted(usage_by_task.items(), key=lambda x: -x[1]["input"]):
    total = data["input"] + data["output"]
    avg = total / data["count"] if data["count"] > 0 else 0
    print(f"{task:<30} {data['count']:>5} calls  {total:>10,} tokens  {avg:>8,.0f} avg/call")

Tip: Use LangSmith's "Playground" feature to replay production traces with modified prompts and immediately see the token count difference. This is the fastest way to test whether a prompt optimization actually reduces tokens in a realistic context — much faster than setting up a custom benchmark.


Building a Custom Token Dashboard — SQLite + Python

For teams that need more control over their analytics or that use multiple LLM providers, a lightweight custom dashboard built on SQLite gives full flexibility.

Schema design:

-- token_usage.db

CREATE TABLE usage_records (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp   TEXT NOT NULL,
    session_id  TEXT NOT NULL,
    call_seq    INTEGER NOT NULL,  -- call number within session
    model       TEXT NOT NULL,
    task_label  TEXT NOT NULL,
    feature     TEXT,
    team        TEXT,
    environment TEXT DEFAULT 'development',
    sprint      TEXT,
    jira_ticket TEXT,

    -- Token counts
    input_tokens        INTEGER NOT NULL DEFAULT 0,
    output_tokens       INTEGER NOT NULL DEFAULT 0,
    cache_read_tokens   INTEGER NOT NULL DEFAULT 0,
    cache_write_tokens  INTEGER NOT NULL DEFAULT 0,
    reasoning_tokens    INTEGER NOT NULL DEFAULT 0,

    -- Derived
    total_tokens  INTEGER GENERATED ALWAYS AS (
        input_tokens + output_tokens
    ) STORED,

    -- Cost (pre-computed at write time with current pricing)
    estimated_cost_usd  REAL,

    -- Quality signals
    stop_reason     TEXT,
    latency_ms      INTEGER,
    retry_count     INTEGER DEFAULT 0,

    -- Optimization signals  
    had_cache_hit   INTEGER GENERATED ALWAYS AS (
        CASE WHEN cache_read_tokens > 0 THEN 1 ELSE 0 END
    ) STORED
);

CREATE INDEX idx_timestamp ON usage_records(timestamp);
CREATE INDEX idx_task_label ON usage_records(task_label);
CREATE INDEX idx_session ON usage_records(session_id);
CREATE INDEX idx_feature ON usage_records(feature);

Python logger class:

import sqlite3
import uuid
from contextlib import contextmanager
from dataclasses import dataclass, field
from datetime import datetime, timezone
import time

PRICING = {
    "claude-opus-4": {"input": 15.00, "output": 75.00, "cache_read": 1.50, "cache_write": 18.75},
    "claude-sonnet-4-5": {"input": 3.00, "output": 15.00, "cache_read": 0.30, "cache_write": 3.75},
    "claude-haiku-3-5": {"input": 0.80, "output": 4.00, "cache_read": 0.08, "cache_write": 1.00},
    "gpt-4o": {"input": 2.50, "output": 10.00, "cache_read": 1.25, "cache_write": 0.0},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60, "cache_read": 0.075, "cache_write": 0.0},
}

class TokenUsageDB:
    def __init__(self, db_path: str = "token_usage.db"):
        self.db_path = db_path
        self._init_db()

    def _init_db(self):
        with sqlite3.connect(self.db_path) as conn:
            conn.executescript("""
                CREATE TABLE IF NOT EXISTS usage_records (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    timestamp TEXT NOT NULL,
                    session_id TEXT NOT NULL,
                    call_seq INTEGER NOT NULL,
                    model TEXT NOT NULL,
                    task_label TEXT NOT NULL,
                    feature TEXT,
                    team TEXT,
                    environment TEXT DEFAULT 'development',
                    sprint TEXT,
                    jira_ticket TEXT,
                    input_tokens INTEGER NOT NULL DEFAULT 0,
                    output_tokens INTEGER NOT NULL DEFAULT 0,
                    cache_read_tokens INTEGER NOT NULL DEFAULT 0,
                    cache_write_tokens INTEGER NOT NULL DEFAULT 0,
                    estimated_cost_usd REAL,
                    stop_reason TEXT,
                    latency_ms INTEGER,
                    retry_count INTEGER DEFAULT 0
                );
                CREATE INDEX IF NOT EXISTS idx_timestamp ON usage_records(timestamp);
                CREATE INDEX IF NOT EXISTS idx_task ON usage_records(task_label);
                CREATE INDEX IF NOT EXISTS idx_feature ON usage_records(feature);
                CREATE INDEX IF NOT EXISTS idx_session ON usage_records(session_id);
            """)

    def log(
        self,
        session_id: str,
        call_seq: int,
        model: str,
        input_tokens: int,
        output_tokens: int,
        cache_read_tokens: int = 0,
        cache_write_tokens: int = 0,
        task_label: str = "untagged",
        feature: str = None,
        team: str = None,
        environment: str = "development",
        sprint: str = None,
        jira_ticket: str = None,
        stop_reason: str = None,
        latency_ms: int = None,
        retry_count: int = 0
    ) -> None:
        pricing = PRICING.get(model, {"input": 0, "output": 0, "cache_read": 0, "cache_write": 0})
        cost = (
            input_tokens / 1e6 * pricing["input"]
            + output_tokens / 1e6 * pricing["output"]
            + cache_read_tokens / 1e6 * pricing["cache_read"]
            + cache_write_tokens / 1e6 * pricing["cache_write"]
        )

        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                INSERT INTO usage_records (
                    timestamp, session_id, call_seq, model, task_label,
                    feature, team, environment, sprint, jira_ticket,
                    input_tokens, output_tokens, cache_read_tokens, cache_write_tokens,
                    estimated_cost_usd, stop_reason, latency_ms, retry_count
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                datetime.now(timezone.utc).isoformat(), session_id, call_seq, model, task_label,
                feature, team, environment, sprint, jira_ticket,
                input_tokens, output_tokens, cache_read_tokens, cache_write_tokens,
                cost, stop_reason, latency_ms, retry_count
            ))

    def weekly_report(self) -> None:
        with sqlite3.connect(self.db_path) as conn:
            conn.row_factory = sqlite3.Row
            rows = conn.execute("""
                SELECT 
                    task_label,
                    COUNT(*) as calls,
                    SUM(input_tokens) as total_input,
                    SUM(output_tokens) as total_output,
                    SUM(input_tokens + output_tokens) as total_tokens,
                    AVG(input_tokens + output_tokens) as avg_tokens_per_call,
                    SUM(estimated_cost_usd) as total_cost,
                    AVG(latency_ms) as avg_latency_ms,
                    SUM(CASE WHEN cache_read_tokens > 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as cache_hit_rate
                FROM usage_records
                WHERE timestamp > datetime('now', '-7 days')
                GROUP BY task_label
                ORDER BY total_cost DESC
            """).fetchall()

            print(f"\n{'Task':<25} {'Calls':>6} {'Tokens':>12} {'Cost':>10} {'Cache%':>8} {'Latency':>9}")
            print("=" * 75)
            for row in rows:
                print(f"{row['task_label']:<25} {row['calls']:>6} "
                      f"{row['total_tokens']:>12,} "
                      f"${row['total_cost']:>9.4f} "
                      f"{row['cache_hit_rate']:>7.1f}% "
                      f"{row['avg_latency_ms']:>8.0f}ms")

    def session_drill_down(self, session_id: str) -> None:
        """Show per-call breakdown for a specific session."""
        with sqlite3.connect(self.db_path) as conn:
            conn.row_factory = sqlite3.Row
            rows = conn.execute("""
                SELECT call_seq, task_label, input_tokens, output_tokens,
                       cache_read_tokens, estimated_cost_usd, latency_ms, stop_reason
                FROM usage_records
                WHERE session_id = ?
                ORDER BY call_seq
            """, (session_id,)).fetchall()

            print(f"\nSession: {session_id}")
            print(f"{'Call':>5} {'Task':<25} {'Input':>8} {'Output':>8} {'Cache':>8} {'Cost':>9} {'ms':>6}")
            print("-" * 70)
            for row in rows:
                print(f"{row['call_seq']:>5} {row['task_label']:<25} "
                      f"{row['input_tokens']:>8,} {row['output_tokens']:>8,} "
                      f"{row['cache_read_tokens']:>8,} "
                      f"${row['estimated_cost_usd']:>8.5f} {row['latency_ms']:>6}")

@contextmanager
def tracked_session(db: TokenUsageDB, task_label: str, **metadata):
    session_id = str(uuid.uuid4())[:8]
    call_seq = [0]

    def log_call(model, input_t, output_t, cache_read=0, cache_write=0, 
                 latency=None, stop_reason=None):
        call_seq[0] += 1
        db.log(
            session_id=session_id, call_seq=call_seq[0],
            model=model, input_tokens=input_t, output_tokens=output_t,
            cache_read_tokens=cache_read, cache_write_tokens=cache_write,
            task_label=task_label, latency_ms=latency, stop_reason=stop_reason,
            **metadata
        )

    yield log_call
    print(f"Session {session_id} complete: {call_seq[0]} calls")

Tip: For QA engineers: include your TokenUsageDB in your CI pipeline. After each test run against the agent, query the database to assert that token usage is within acceptable bounds. If a code change causes the average tokens-per-call for the test_generation task to increase by more than 15%, fail the build and require review. This turns token efficiency into a regression-testable metric.


Establishing Baseline Benchmarks — The Before Picture

Before any optimization work begins, run your agent against a representative benchmark suite and record the baseline. Without a documented baseline, you cannot measure optimization progress, and you cannot justify the engineering time spent on optimization to stakeholders.

A baseline benchmark script:

import time
import json
import statistics
from pathlib import Path

class TokenBaselineBenchmark:
    """
    Run a standardized set of tasks and measure token footprint.
    Run this BEFORE optimization work to establish your baseline.
    Run again AFTER to measure improvements.
    """

    def __init__(self, agent_runner, db: TokenUsageDB, benchmark_name: str):
        self.agent_runner = agent_runner
        self.db = db
        self.benchmark_name = benchmark_name
        self.results = []

    def run_task(self, task_id: str, task_prompt: str, expected_output_quality: str) -> dict:
        session_id = f"{self.benchmark_name}_{task_id}"
        start_time = time.time()

        result = self.agent_runner.run(task_prompt)
        elapsed_ms = int((time.time() - start_time) * 1000)

        # Note: in real use, replace with actual usage from your tracked session
        record = {
            "task_id": task_id,
            "benchmark": self.benchmark_name,
            "elapsed_ms": elapsed_ms,
            "prompt_preview": task_prompt[:100],
        }
        self.results.append(record)
        return record

    def save_baseline(self, output_path: str) -> None:
        baseline = {
            "benchmark_name": self.benchmark_name,
            "run_date": datetime.now(timezone.utc).isoformat(),
            "task_count": len(self.results),
            "results": self.results,
        }
        Path(output_path).write_text(json.dumps(baseline, indent=2))
        print(f"Baseline saved to {output_path}")

    @staticmethod
    def compare_baselines(before_path: str, after_path: str) -> None:
        before = json.loads(Path(before_path).read_text())
        after = json.loads(Path(after_path).read_text())

        print(f"\nOptimization Results: {before['benchmark_name']}")
        print(f"Before: {before['run_date']}")
        print(f"After:  {after['run_date']}")
        print("=" * 70)

        before_by_id = {r["task_id"]: r for r in before["results"]}
        after_by_id = {r["task_id"]: r for r in after["results"]}

        for task_id in before_by_id:
            if task_id not in after_by_id:
                continue
            b = before_by_id[task_id]
            a = after_by_id[task_id]
            # Display comparison (extend with real token fields)
            print(f"Task {task_id}: {b['elapsed_ms']}ms → {a['elapsed_ms']}ms")


BENCHMARK_TASKS = [
    {
        "task_id": "code_review_small",
        "prompt": "Review this 50-line Python function for bugs and style issues: [function]",
        "category": "code_review"
    },
    {
        "task_id": "test_gen_feature",
        "prompt": "Generate comprehensive test cases for a user login feature with SSO support.",
        "category": "test_generation"
    },
    {
        "task_id": "sprint_planning",
        "prompt": "Here are 12 user stories for our next sprint. Estimate story points and "
                  "identify dependencies: [stories]",
        "category": "pm_workflow"
    },
    {
        "task_id": "bug_diagnosis",
        "prompt": "Diagnose this production error: [stack trace + logs]. Identify root cause "
                  "and suggest fix.",
        "category": "debugging"
    },
    {
        "task_id": "doc_generation",
        "prompt": "Generate API documentation for this endpoint definition: [OpenAPI spec]",
        "category": "documentation"
    },
]

Tip: For product managers: define your optimization success criteria before starting the baseline benchmark. For example: "We will consider the token optimization project successful if we achieve (1) 40% reduction in average tokens per code review, (2) less than $200/month total LLM cost at our current volume, and (3) no measurable regression in review quality as rated by engineers." These criteria make the project scope clear and the outcome measurable.


Real-Time Token Monitoring — Production Observability

Once your agent is in production, you need real-time visibility into token consumption so you can detect anomalies before they become expensive incidents.

Key metrics to monitor in production:


METRICS_TO_TRACK = {
    # Volume metrics
    "llm.requests.total": "Counter — total API calls",
    "llm.tokens.input": "Counter — total input tokens",
    "llm.tokens.output": "Counter — total output tokens",
    "llm.tokens.cached": "Counter — tokens served from cache",

    # Cost metrics
    "llm.cost.usd": "Gauge — estimated cost per minute",
    "llm.cost.per_task": "Histogram — cost per task type",

    # Efficiency metrics
    "llm.cache_hit_rate": "Gauge — % requests with cache hit",
    "llm.tokens.per_task": "Histogram — tokens per task type",
    "llm.context_utilization": "Histogram — % of context window used",

    # Quality signals
    "llm.retry_rate": "Gauge — % requests that required retry",
    "llm.stop_reason": "Counter — by stop reason (end_turn, max_tokens, tool_use)",
    "llm.latency_ms": "Histogram — end-to-end latency",

    # Agentic loop metrics
    "agent.iterations": "Histogram — iterations to task completion",
    "agent.session_tokens": "Histogram — total tokens per session",
    "agent.tool_calls_per_session": "Histogram — tool call volume",
}

ALERT_THRESHOLDS = {
    "hourly_cost_usd": 50.00,          # Alert if >$50/hour
    "avg_tokens_per_code_review": 15_000,  # Alert if avg review exceeds this
    "cache_hit_rate_minimum": 0.60,    # Alert if cache hit rate drops below 60%
    "retry_rate_maximum": 0.10,        # Alert if >10% of calls require retry
    "p99_latency_ms": 30_000,          # Alert if p99 latency exceeds 30 seconds
}

Implementing Prometheus metrics emission:

from prometheus_client import Counter, Histogram, Gauge, start_http_server

llm_requests_total = Counter(
    "llm_requests_total", "Total LLM API requests",
    labelnames=["model", "task_label", "environment"]
)
llm_input_tokens_total = Counter(
    "llm_input_tokens_total", "Total input tokens",
    labelnames=["model", "task_label"]
)
llm_output_tokens_total = Counter(
    "llm_output_tokens_total", "Total output tokens",
    labelnames=["model", "task_label"]
)
llm_cost_usd_total = Counter(
    "llm_cost_usd_total", "Total estimated cost in USD",
    labelnames=["model", "task_label"]
)
llm_session_tokens = Histogram(
    "llm_session_tokens", "Tokens per agent session",
    labelnames=["task_label"],
    buckets=[1000, 5000, 10000, 25000, 50000, 100000, 200000]
)

def emit_metrics(usage_record: dict) -> None:
    labels = {
        "model": usage_record["model"],
        "task_label": usage_record["task_label"],
    }
    env_labels = {**labels, "environment": usage_record.get("environment", "unknown")}

    llm_requests_total.labels(**env_labels).inc()
    llm_input_tokens_total.labels(**labels).inc(usage_record["tokens"]["input"])
    llm_output_tokens_total.labels(**labels).inc(usage_record["tokens"]["output"])
    llm_cost_usd_total.labels(**labels).inc(usage_record.get("cost_usd", 0))

Tip: Set up a "token anomaly" alert for your production system: if the token usage for any task type increases by more than 30% compared to the 7-day rolling average, send a Slack notification. This catches prompt injections (where adversarial input causes the model to generate massive outputs), context accumulation bugs, and accidental prompt regression from code changes — all before they appear on your monthly bill.


The Token Footprint Report — Communicating Findings to Stakeholders

The final deliverable of your token measurement work is a report that translates raw token data into business-intelligible findings. This is especially important for product managers who need to justify optimization investments and for engineering leads presenting to budget holders.

Token footprint report template:

TOKEN FOOTPRINT REPORT
Sprint: 2026-Q2-S3 | Period: May 1–14, 2026 | Team: Platform Engineering
════════════════════════════════════════════════════════════════════════

EXECUTIVE SUMMARY
─────────────────
Total LLM cost (2 weeks):  $847.32
Projected monthly cost:    $1,831.86
vs. Previous period:       +23% (driven by new test generation feature)
Primary cost driver:       Code review agent (61% of total cost)
Top optimization target:   Tool definitions in code review agent (38% of input)

COST BY FEATURE
───────────────
Feature               │ Calls │  Total Tokens │     Cost │ $/Call │ Trend
──────────────────────┼───────┼───────────────┼──────────┼────────┼──────
Code review           │   842 │   38,421,000  │ $515.22  │ $0.611 │  +5%
Test generation       │   312 │   11,847,000  │ $189.44  │ $0.607 │ +67%
Sprint planning       │    48 │    4,221,000  │  $78.31  │ $1.632 │  +2%
PR description gen    │   621 │    3,189,000  │  $41.89  │ $0.067 │  -8%
Retrospective summary │    24 │    1,391,000  │  $22.46  │ $0.936 │  +1%

TOKEN EFFICIENCY METRICS
────────────────────────
Metric                        │ Current │ Target │ Status
──────────────────────────────┼─────────┼────────┼────────
Avg tokens/code review        │  45,618 │ 25,000 │  ⚠ ABOVE TARGET
Cache hit rate                │   72.3% │  85.0% │  ⚠ BELOW TARGET
Output/input token ratio      │   0.087 │  0.100 │  ✓ OK
Avg agent iterations/task     │    4.2  │   3.0  │  ⚠ ABOVE TARGET
P99 context utilization       │   41.2% │  60.0% │  ✓ OK (low is good)

TOP 3 OPTIMIZATION OPPORTUNITIES
─────────────────────────────────
1. Implement prompt caching for code review tool definitions
   → Estimated savings: $180/month (21% of current code review cost)
   → Engineering effort: 1 day
   → Priority: HIGH

2. Compress verbose tool outputs in code review agent
   → Current avg tool result size: 2,847 tokens
   → Target avg tool result size: 400 tokens
   → Estimated savings: $130/month
   → Engineering effort: 3 days
   → Priority: HIGH

3. Reduce test generation agent iterations from 5.1 avg to 3.0
   → Root cause: agent re-reads test framework docs on every iteration
   → Estimated savings: $60/month
   → Engineering effort: 2 days
   → Priority: MEDIUM

Tip: Present the token footprint report in a business review meeting once per sprint, not just as a technical artifact in a Confluence page. When engineering leads, QA leads, and product managers review the same report together, optimization priorities align naturally with business priorities. The question "should we spend 3 days optimizing the test generation agent?" has a much clearer answer when everyone sees it costs $189/month and is growing at 67% per period.