·

Effective token optimization begins not with prompt rewriting but with measurement. Without a clear picture of where tokens are being consumed, by which agents, in which tasks, and at what cost, any optimization effort is guesswork. This topic covers the complete analytics stack: what signals matter, how to instrument your systems, which visualization approaches reveal actionable patterns, and how to establish benchmarks that drive continuous improvement.


Why Token Analytics Is a First-Class Engineering Concern

Token consumption is the primary cost driver in LLM-powered systems, and unlike compute costs in traditional software, token costs are shaped by architectural decisions made at every layer of the stack — system prompts, tool definitions, context accumulation, retrieval strategies, and response formatting.

Senior engineers often treat LLM calls as black boxes. Token analytics breaks open that box. When you can answer questions like "why did this workflow suddenly cost 40% more last Tuesday?" or "which prompt template is most efficient for our code review agent?" you move from reactive firefighting to proactive optimization.

The Four Dimensions of Token Measurement

Token analytics must be tracked across four dimensions to be actionable:

Volume: Raw token counts — prompt tokens, completion tokens, total tokens — per call, per session, per workflow, and per agent type.

Cost: Translate token counts into dollars using the current pricing tier for each model. This must account for input/output asymmetries (GPT-4o charges differently for input vs. output), cached token discounts (Anthropic prompt caching, OpenAI Prompt Cache), and batch vs. real-time pricing.

Efficiency: Tokens consumed relative to task value delivered. A workflow that uses 80,000 tokens to summarize a document is less efficient than one that uses 12,000 tokens for the same quality output.

Trend: How token consumption changes over time, across deployments, and in response to prompt changes. Trends reveal drift, regressions, and the impact of optimizations.

Tip: Start tracking the input/output token ratio for each agent type. A ratio heavily skewed toward input tokens (e.g., 10:1 or higher) often signals bloated context or system prompts that can be compressed. A ratio skewed toward output (e.g., 1:5) may indicate prompts that aren't constraining response length effectively.


Instrumenting Your LLM Calls for Analytics

Before you can visualize or benchmark, you need structured telemetry. The instrumentation approach depends on your stack, but the principle is the same: every LLM call must emit structured metadata that can be aggregated later.

Minimal Instrumentation Schema

Every LLM call in a production system should log the following fields:

{
  "trace_id": "uuid",
  "span_id": "uuid",
  "timestamp_utc": "2026-05-10T14:23:00Z",
  "model": "claude-sonnet-4-5",
  "agent_name": "code_review_agent",
  "workflow_name": "pr_review_pipeline",
  "task_type": "code_review",
  "input_tokens": 14823,
  "output_tokens": 1204,
  "cached_input_tokens": 9200,
  "total_tokens": 16027,
  "cost_usd": 0.0423,
  "latency_ms": 3820,
  "tool_calls_count": 3,
  "turn_number": 2,
  "session_id": "session-abc123",
  "user_id": "eng-team-a",
  "environment": "production"
}

This schema captures everything needed for cost attribution, efficiency analysis, and trend detection.

Instrumentation with LangSmith

LangSmith is the most widely adopted observability platform for LangChain-based agents, but it works as a generic tracing backend for any LLM workflow.

Setup:

import os
from langsmith import Client
from langsmith.wrappers import wrap_openai
import openai

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "token-optimization-module-10"

client = wrap_openai(openai.OpenAI())

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this PR..."}],
    metadata={
        "agent_name": "pr_summary_agent",
        "workflow": "pr_review_pipeline",
        "task_type": "summarization"
    }
)

LangSmith automatically captures input/output tokens, latency, model name, and all message content. You can add custom metadata to every run for downstream filtering.

Instrumentation with Helicone

Helicone acts as a proxy between your application and the LLM API. It requires zero code changes beyond adding a base URL redirect:

import openai

client = openai.OpenAI(
    api_key="your-openai-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-key",
        "Helicone-Property-Agent": "code_review_agent",
        "Helicone-Property-Workflow": "pr_review_pipeline",
        "Helicone-User-Id": "eng-team-a"
    }
)

Helicone automatically logs all token usage, costs, and latency, and provides a dashboard for filtering by your custom properties. It supports Anthropic, OpenAI, Mistral, and other providers.

Instrumenting with OpenTelemetry for Self-Hosted Analytics

For teams running self-hosted infrastructure or needing GDPR-compliant logging, OpenTelemetry with Prometheus and Grafana provides full control:

from opentelemetry import trace, metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader
import prometheus_client

reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader])
meter = provider.get_meter("llm_agent")

token_counter = meter.create_counter(
    "llm_tokens_total",
    description="Total tokens consumed",
    unit="tokens"
)

cost_counter = meter.create_counter(
    "llm_cost_usd_total",
    description="Total LLM cost in USD"
)

latency_histogram = meter.create_histogram(
    "llm_latency_ms",
    description="LLM call latency in milliseconds"
)

def tracked_llm_call(agent_name, workflow, prompt, model="gpt-4o"):
    import time
    start = time.time()
    # ... make LLM call ...
    response = client.chat.completions.create(model=model, messages=prompt)
    elapsed_ms = (time.time() - start) * 1000

    attrs = {"agent": agent_name, "workflow": workflow, "model": model}
    token_counter.add(response.usage.total_tokens, attrs)
    cost_counter.add(calculate_cost(response.usage, model), attrs)
    latency_histogram.record(elapsed_ms, attrs)

    return response

Tip: Tag every LLM call with at least three dimensions: agent_name, workflow_name, and task_type. This three-dimensional tagging lets you slice usage by persona (which team is using which agent), by process (which workflows are expensive), and by function (which task categories are inefficient). Without multi-dimensional tagging, your dashboards will show totals but not sources.


Building Token Analytics Dashboards

Raw logs are not analytics. You need visualizations that surface patterns and anomalies at a glance.

Grafana Dashboard Architecture

A well-structured token analytics dashboard has four layers:

Layer 1: Executive Summary (top of dashboard)
- Total tokens this week vs. last week (percent change)
- Total cost this month vs. budget
- P95 cost per workflow run
- Top 3 most expensive agents

Layer 2: Workflow Drilldown
A table panel showing, for each workflow:
- Total runs today
- Average tokens per run
- P95 tokens per run (reveals outliers)
- Average cost per run
- 7-day trend sparkline

Layer 3: Agent Efficiency Panel
A scatter plot with:
- X-axis: average input tokens
- Y-axis: task success rate (or a proxy metric like user feedback score)
- Bubble size: volume of calls
- Color: agent name

This reveals the efficiency frontier — agents that achieve high success with low tokens are your benchmarks for others.

Layer 4: Anomaly Detection
Time-series charts showing token consumption over time with anomaly bands (mean ± 2 standard deviations). Spikes above the upper band trigger alerts.

Prometheus Queries for Token Analytics

rate(llm_tokens_total[24h]) / rate(llm_calls_total[24h])

sum by (workflow) (increase(llm_cost_usd_total[7d]))

histogram_quantile(0.95, sum by (agent, le) (llm_tokens_bucket))

increase(llm_tokens_total[1d]) / increase(successful_outcomes_total[1d])

LangSmith Analytics Queries

In LangSmith, use the Runs tab with these filters to build analytics views:

Filter: project = "pr_review_pipeline"
Sort by: total_tokens DESC
Group by: run_type, metadata.agent_name

Filter: output_tokens > 2000 AND project = "pr_review_pipeline"

Tip: Build a "token budget exceeded" dashboard tile that shows any workflow run where tokens exceeded 150% of the rolling 7-day average for that workflow type. This is your most reliable early-warning system for prompt drift, context window leaks, and runaway agentic loops.


Establishing Meaningful Token Benchmarks

Benchmarks give optimization efforts a target. Without them, you cannot know whether a change is an improvement. But benchmarks must be set carefully — a benchmark that is too loose is useless, while one set too tight creates false alarms.

The Benchmark Setting Process

Step 1: Establish a baseline. Run your workflows with current prompts and log at least 200 representative runs per workflow type. Compute the mean, median, P95, and P99 token counts for each workflow.

Step 2: Define the benchmark tier structure.

Tier Definition Action
Green ≤ P75 of baseline No action needed
Yellow P75 – P90 of baseline Flag for review
Red P90 – P99 of baseline Alert on-call
Critical > P99 of baseline Page immediately

Step 3: Set cost-per-outcome benchmarks. For each workflow, define a "unit of value" (e.g., one PR reviewed, one test case generated, one user story written). Then set a target cost per unit:

PR Review Agent:
  - Target: < $0.08 per review
  - Acceptable: $0.08 – $0.15 per review
  - Expensive: > $0.15 per review (investigate)

Test Generation Agent:
  - Target: < $0.05 per test file
  - Acceptable: $0.05 – $0.12 per test file
  - Expensive: > $0.12 per test file (investigate)

Step 4: Set efficiency benchmarks. Measure the ratio of output tokens to input tokens for each agent. A code generation agent should produce substantial output relative to its input. A classification agent should produce minimal output. Set expected ranges:

Code Generation Agent:  output/input ratio target = 0.3 – 0.6
Summarization Agent:    output/input ratio target = 0.05 – 0.15
Q&A Agent:             output/input ratio target = 0.1 – 0.3
Classification Agent:   output/input ratio target = 0.01 – 0.05

Industry Reference Benchmarks

Based on production deployments across engineering teams:

  • Simple RAG Q&A: 800–2,500 tokens per query
  • Code review (single file): 5,000–15,000 tokens per review
  • PR summary generation: 3,000–8,000 tokens per PR
  • Test case generation (per function): 2,000–6,000 tokens
  • Sprint planning assistant (per story): 1,500–4,000 tokens
  • Bug triage agent: 2,000–7,000 tokens per ticket

These are starting points. Your actual benchmarks should be calibrated to your specific prompts, codebase complexity, and quality targets.

Tip: Run a "benchmark calibration sprint" at the start of any new project. Before writing a single optimization, spend one sprint just running production-equivalent workloads and logging everything. This gives you clean baseline data that will make all subsequent A/B tests and optimization comparisons statistically meaningful.


Advanced Analytics: Cost Attribution and Chargeback

As LLM-powered tools mature within organizations, teams need to attribute token costs to business units, teams, or product features. This is "token chargeback" — similar to cloud cost allocation.

Implementing Cost Attribution

Tag every LLM call with a cost center hierarchy:

headers = {
    "Helicone-Property-CostCenter": "engineering",
    "Helicone-Property-Team": "platform",
    "Helicone-Property-Feature": "pr-review",
    "Helicone-Property-UserTier": "enterprise"
}

Then build a weekly cost allocation report:

-- Weights & Biases / LangSmith SQL equivalent
SELECT 
    metadata->>'cost_center' as cost_center,
    metadata->>'team' as team,
    metadata->>'feature' as feature,
    COUNT(*) as total_runs,
    SUM(total_tokens) as total_tokens,
    SUM(cost_usd) as total_cost_usd,
    AVG(total_tokens) as avg_tokens_per_run,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_tokens) as p95_tokens
FROM llm_runs
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY cost_center, team, feature
ORDER BY total_cost_usd DESC;

Weights & Biases for ML-Adjacent Teams

For teams already using W&B for model training, the wandb library integrates cleanly with LLM cost tracking:

import wandb

run = wandb.init(project="token-analytics", job_type="inference")

wandb.log({
    "tokens/input": response.usage.prompt_tokens,
    "tokens/output": response.usage.completion_tokens,
    "tokens/cached": response.usage.cached_tokens,
    "cost/usd": calculate_cost(response.usage),
    "latency/ms": elapsed_ms,
    "agent": agent_name,
    "workflow": workflow_name
})

run.finish()

W&B's sweep and comparison features then let you compare token efficiency across different prompt versions as a first-class experiment.

Tip: Build a monthly "token budget review" meeting into your team cadence. Present the cost attribution report, highlight the top three cost drivers, and assign ownership for optimization investigations. Treat token cost the same way you treat cloud infrastructure spend — with regular reviews, owners, and targets. This cultural habit is often worth more than any single technical optimization.


Summary

Token usage analytics is the foundation of all optimization work in this course. The key practices are:

  1. Instrument every LLM call with structured metadata covering agent, workflow, task type, and token counts
  2. Use purpose-built tools (LangSmith, Helicone) or self-hosted stacks (OTel + Prometheus + Grafana) to aggregate telemetry
  3. Build dashboards with four layers: executive summary, workflow drilldown, efficiency scatter, and anomaly detection
  4. Set benchmarks using baseline percentiles (P75/P90/P99) and cost-per-outcome targets
  5. Implement cost attribution with hierarchical tagging for team-level accountability
  6. Run a benchmark calibration sprint before starting any optimization effort