Team-Level Optimization | Token Optimization Masterclass

Individual token optimizations compound when they are shared. A team of ten engineers, each independently writing system prompts, will produce ten different approaches — some efficient, some wasteful, none benefiting from the insights of the others. Team-level optimization is the infrastructure, culture, and process work that turns individual discoveries into team standards, and prevents the same inefficiencies from being re-invented repeatedly.

This topic covers the architecture of a shared prompt library, context template systems, collaborative optimization practices, and how to build a learning organization around token efficiency.

Why Individual Optimization Is Not Enough

Consider a typical engineering team building an AI-assisted development platform. The team has:
- A code review agent (built by the backend team)
- A test generation agent (built by the QA team)
- A PR summarization agent (built by the DevOps team)
- A sprint planning assistant (built by the product team)

Each of these was built in a silo. Each team discovered, independently, that their agent was too verbose. Each team spent time compressing their system prompt. Each team re-discovered the same insight: "Converting prose instructions to numbered lists reduces tokens by 30-40%."

This is not just inefficiency — it is a missed learning opportunity. Team-level optimization creates shared infrastructure so that when one team learns something, everyone benefits immediately.

The three levels of team-level optimization:

Shared prompt libraries: Versioned, tested system prompts and prompt components that all agents can reference
Context templates: Standardized patterns for injecting context (code, user stories, documentation) in token-efficient formats
Team practices: Review processes, coding standards, and cultural norms that prevent token waste before it happens

Tip: Before building any shared infrastructure, audit your existing prompts. Collect all system prompts currently in production, measure their token counts, and identify the top five patterns of waste. These patterns — not theoretical best practices — should drive the design of your shared library. Build to solve real problems you can measure.

Building a Shared Prompt Library

A shared prompt library is a version-controlled repository of prompt components — system prompt blocks, instruction modules, and output format templates — that teams can compose into complete agent prompts.

Library Architecture

prompts/
├── components/
│   ├── roles/
│   │   ├── senior_engineer.md
│   │   ├── qa_engineer.md
│   │   ├── product_manager.md
│   │   └── scrum_master.md
│   ├── instructions/
│   │   ├── code_review_guidelines.md
│   │   ├── security_checklist.md
│   │   ├── output_format_compact.md
│   │   └── output_format_structured_json.md
│   └── context_frames/
│       ├── pr_context_template.md
│       ├── ticket_context_template.md
│       └── codebase_summary_template.md
├── agents/
│   ├── pr_review_agent/
│   │   ├── system_prompt.md
│   │   ├── metadata.yaml
│   │   └── tests/
│   │       ├── eval_set.json
│   │       └── benchmarks.yaml
│   └── test_generation_agent/
│       ├── system_prompt.md
│       ├── metadata.yaml
│       └── tests/
├── templates/
│   ├── new_agent_template/
│   └── optimization_checklist.md
└── benchmarks/
    ├── token_budgets.yaml
    └── quality_baselines.yaml

Prompt Metadata Schema

Every prompt in the library must have metadata:

name: pr_review_agent
version: "2.4.1"
last_updated: "2026-04-22"
owner: "platform-team"
task_type: "code_review"
model_compatibility:
  - claude-sonnet-4-5
  - gpt-4o
  - gpt-4o-mini  # validated for simple reviews only

token_benchmarks:
  system_prompt_tokens: 580
  typical_total_tokens: 8500
  p95_total_tokens: 18000
  token_budget_max: 25000

quality_baselines:
  quality_score_target: 4.2
  quality_score_minimum: 3.8
  eval_set: "tests/eval_set.json"

components_used:
  - roles/senior_engineer
  - instructions/code_review_guidelines
  - instructions/security_checklist
  - instructions/output_format_compact

optimization_notes:
  - "v2.0: Compressed security checklist from prose to bullet list, -340 tokens"
  - "v2.2: Removed redundant context-setting introduction, -180 tokens"
  - "v2.4: Added output length constraint, -15% average output tokens"

changelog:
  - version: "2.4.1"
    date: "2026-04-22"
    change: "Added 'be concise' constraint to output instructions"
    token_delta: -120
    quality_delta: -0.05

This metadata serves as both documentation and a machine-readable optimization history.

Prompt Composition System

Build a lightweight composition layer that assembles complete prompts from components:

import yaml
from pathlib import Path
from typing import Optional

PROMPTS_ROOT = Path("/repos/ai-platform/prompts")

def count_tokens(text: str) -> int:
    """Count tokens using tiktoken or Anthropic tokenizer"""
    import tiktoken
    enc = tiktoken.encoding_for_model("gpt-4o")
    return len(enc.encode(text))

class PromptComposer:
    def __init__(self, agent_name: str):
        self.agent_name = agent_name
        self.agent_path = PROMPTS_ROOT / "agents" / agent_name
        self.metadata = self._load_metadata()

    def _load_metadata(self) -> dict:
        with open(self.agent_path / "metadata.yaml") as f:
            return yaml.safe_load(f)

    def _load_component(self, component_path: str) -> str:
        full_path = PROMPTS_ROOT / "components" / component_path
        # Try .md first, then .txt
        for ext in [".md", ".txt", ""]:
            p = Path(str(full_path) + ext)
            if p.exists():
                return p.read_text().strip()
        raise FileNotFoundError(f"Component not found: {component_path}")

    def build_system_prompt(
        self, 
        override_components: Optional[dict] = None,
        include_optimization_hints: bool = False
    ) -> str:
        """Build complete system prompt from components."""
        components = []

        for component_path in self.metadata["components_used"]:
            if override_components and component_path in override_components:
                # Allow A/B testing with component overrides
                components.append(override_components[component_path])
            else:
                components.append(self._load_component(component_path))

        system_prompt = "\n\n".join(components)

        # Budget check
        token_count = count_tokens(system_prompt)
        budget = self.metadata["token_benchmarks"]["system_prompt_tokens"]

        if token_count > budget * 1.2:  # >20% over budget
            print(f"WARNING: System prompt for {self.agent_name} is {token_count} tokens "
                  f"(budget: {budget}, {token_count/budget:.1%} of budget)")

        return system_prompt

    def validate_against_benchmarks(self, system_prompt: str) -> dict:
        """Validate a prompt against the stored benchmarks."""
        token_count = count_tokens(system_prompt)
        budget = self.metadata["token_benchmarks"]["system_prompt_tokens"]

        return {
            "token_count": token_count,
            "budget": budget,
            "budget_utilization": token_count / budget,
            "status": "ok" if token_count <= budget * 1.1 else "over_budget",
            "delta_from_baseline": token_count - budget
        }

composer = PromptComposer("pr_review_agent")
system_prompt = composer.build_system_prompt()
validation = composer.validate_against_benchmarks(system_prompt)
print(f"System prompt: {validation['token_count']} tokens ({validation['budget_utilization']:.1%} of budget)")

Tip: Store your prompt library in a separate Git repository from your application code, or in a dedicated directory with its own review process. Prompt changes are configuration changes with direct cost and quality implications. Treating them with the same rigor as infrastructure changes — PR reviews, testing requirements, staged rollouts — prevents the ad-hoc "just quick fix" changes that cause most prompt regressions.

Context Templates for Token-Efficient Information Passing

When agents need external context (code files, tickets, documentation), the format of that context dramatically affects token consumption. Context templates standardize how information is passed to agents.

The Context Template Design Principles

Principle 1: Include only what the agent needs for the task

Instead of passing an entire file, pass a structured summary:

ticket_context = get_full_jira_ticket(ticket_id)  # 2,000+ tokens including HTML, metadata, comments

def format_ticket_context(ticket_id: str) -> str:
    ticket = get_full_jira_ticket(ticket_id)

    return f"""## Ticket: {ticket['key']}
**Title**: {ticket['summary']}
**Type**: {ticket['issue_type']} | **Priority**: {ticket['priority']} | **Status**: {ticket['status']}
**Reporter**: {ticket['reporter']} | **Assignee**: {ticket['assignee']}

**Description**:
{truncate_to_tokens(ticket['description'], max_tokens=400)}

**Acceptance Criteria**:
{format_as_bullets(ticket['acceptance_criteria'])}

**Labels**: {', '.join(ticket['labels'])}
"""

Principle 2: Use compact formats over natural language for structured data

def format_code_context(file_path: str, include_full_content: bool = False) -> str:
    code = read_file(file_path)

    if include_full_content:
        return f"""### File: {file_path}
```{get_extension(file_path)}
{code}
```"""

    # Compact summary for large files
    symbols = extract_symbols(code)  # functions, classes, constants
    return f"""### File: {file_path} ({count_lines(code)} lines)
**Symbols**: {', '.join(symbols[:20])}
**Dependencies**: {', '.join(get_imports(code)[:10])}
**Last modified**: {get_last_modified(file_path)}

**Key sections** (abbreviated):
{extract_key_sections(code, max_tokens=600)}
"""

def format_pr_context(pr_id: str) -> str:
    pr = get_pr_data(pr_id)

    return f"""### PR #{pr['number']}: {pr['title']}
**Author**: {pr['author']} | **Branch**: {pr['head']} → {pr['base']}
**Files changed**: {pr['changed_files']} | **+{pr['additions']}/-{pr['deletions']}**
**Reviewers requested**: {', '.join(pr['reviewers'])}
**Labels**: {', '.join(pr['labels'])}

**Description**:
{truncate_to_tokens(pr['body'], max_tokens=200)}

**Diff** (key changes):
{pr['diff']}
"""

Principle 3: Use tiered context depth based on agent task complexity

CONTEXT_DEPTH_LEVELS = {
    "minimal": {
        "ticket_fields": ["key", "summary", "type", "priority"],
        "max_description_tokens": 100,
        "include_comments": False,
        "include_attachments": False
    },
    "standard": {
        "ticket_fields": ["key", "summary", "type", "priority", "status", "acceptance_criteria"],
        "max_description_tokens": 400,
        "include_comments": True,
        "max_comments": 3,
        "include_attachments": False
    },
    "comprehensive": {
        "ticket_fields": "all",
        "max_description_tokens": 1200,
        "include_comments": True,
        "max_comments": 10,
        "include_attachments": True
    }
}

def get_context_depth(task_type: str) -> str:
    """Map task types to appropriate context depth."""
    depth_map = {
        "classification": "minimal",
        "triage": "minimal", 
        "summarization": "standard",
        "code_review": "standard",
        "implementation_planning": "comprehensive",
        "architecture_review": "comprehensive"
    }
    return depth_map.get(task_type, "standard")

Shared Context Template Library

context_templates/
├── code/
│   ├── file_summary.md           # For referencing a file without full content
│   ├── diff_review_context.md    # For PR diff analysis
│   └── codebase_map.md           # For multi-file analysis
├── project/
│   ├── ticket_standard.md        # Standard Jira/Linear ticket format
│   ├── ticket_minimal.md         # For classification/triage
│   └── sprint_context.md         # Sprint overview for planning agents
├── documentation/
│   ├── api_reference.md          # For referencing API docs
│   └── runbook_excerpt.md        # For incident response agents
└── test/
    ├── test_failure_context.md   # For test failure analysis
    └── coverage_report.md        # For test generation agents

Tip: Measure the token cost of your context templates by running them against 50 representative inputs and computing average, P50, and P95 token counts. Publish these benchmarks alongside the templates. Engineers who see "Standard ticket template: ~320 tokens average" make more informed decisions about when to use comprehensive vs. minimal context depth.

Building a Team Optimization Practice

Tooling without culture does not work. These are the practices that make team-level optimization sustainable.

The Prompt Review Process

Any change to a production prompt should go through a lightweight review process:

## Prompt Change Review Checklist

### Before Submitting a Prompt PR
- [ ] Token count before and after documented in PR description
- [ ] Percentage change calculated (using `make count-tokens`)
- [ ] Quality validation run on eval set (>50 examples)
- [ ] metadata.yaml updated with new token count and optimization note
- [ ] If adding tokens: justification provided (what quality improvement does this enable?)
- [ ] If removing tokens: quality guard metrics confirmed stable

### Review Criteria for Approvers
- [ ] Change is additive (adds a component) OR has a quality validation result
- [ ] No unexplained token increases > 10%
- [ ] Formatting is consistent with library standards
- [ ] No duplication of existing components

Prompt Linting in CI

Add a prompt linter to your CI pipeline that enforces basic token hygiene:

import sys
import yaml
from pathlib import Path

PROMPTS_ROOT = Path("prompts")
RULES = {
    "max_system_prompt_tokens": 1500,
    "warn_system_prompt_tokens": 800,
    "max_line_length": 200,
    "max_consecutive_blank_lines": 2
}

def lint_prompt(prompt_path: Path) -> list[str]:
    issues = []
    content = prompt_path.read_text()
    token_count = count_tokens(content)

    if token_count > RULES["max_system_prompt_tokens"]:
        issues.append(f"ERROR: {prompt_path} exceeds max token limit "
                     f"({token_count} > {RULES['max_system_prompt_tokens']})")
    elif token_count > RULES["warn_system_prompt_tokens"]:
        issues.append(f"WARNING: {prompt_path} is large "
                     f"({token_count} tokens — consider compression)")

    # Check for common verbosity patterns
    if "As an AI assistant" in content or "As a helpful AI" in content:
        issues.append(f"STYLE: {prompt_path} contains redundant AI self-description")

    if content.count("Please ") > 3:
        issues.append(f"STYLE: {prompt_path} uses 'Please' excessively (reduces density)")

    lines = content.split('\n')
    for i, line in enumerate(lines, 1):
        if len(line) > RULES["max_line_length"]:
            issues.append(f"FORMAT: {prompt_path}:{i} line exceeds {RULES['max_line_length']} chars")

    # Check metadata exists
    metadata_path = prompt_path.parent / "metadata.yaml"
    if not metadata_path.exists():
        issues.append(f"ERROR: {prompt_path} missing metadata.yaml")

    return issues

if __name__ == "__main__":
    all_issues = []
    for prompt_file in PROMPTS_ROOT.rglob("system_prompt.md"):
        all_issues.extend(lint_prompt(prompt_file))

    errors = [i for i in all_issues if i.startswith("ERROR")]
    warnings = [i for i in all_issues if i.startswith("WARNING")]

    for issue in all_issues:
        print(issue)

    if errors:
        print(f"\n{len(errors)} errors found. Fix before merging.")
        sys.exit(1)

    print(f"\n{len(warnings)} warnings. {len(PROMPTS_ROOT.rglob('system_prompt.md'))} prompts checked.")

Weekly Token Optimization Stand-Up Format

A 15-minute weekly check-in format that keeps token optimization visible without consuming significant meeting time:

## Weekly Token Optimization Check-In (15 min)

**Format**: 3-2-1 Report + Action Items

**3 Metrics** (5 min):
1. Total weekly cost vs. last week (% change)
2. Top regression this week (agent + estimated cost)
3. Optimization wins shipped this week (token reduction + savings)

**2 Hypotheses** (5 min):
1. Most promising hypothesis from the backlog — quick status update
2. Any new hypothesis from this week's observations

**1 Decision** (5 min):
- One experiment to start, one to close, or one change to ship

**Async prep**: Token optimization owner posts the 3-2-1 report to Slack
24 hours before the meeting so the team can review.

Tip: Keep the weekly check-in strictly to 15 minutes by doing all the analysis async before the meeting. The meeting is for decisions and alignment, not for reading dashboards together. If it regularly runs over 15 minutes, you are solving problems in the meeting that should have been solved beforehand.

Persona-Specific Optimization Standards

Different personas on the team interact with AI agents differently. Product managers have different patterns of inefficiency than engineers. QA engineers have different context needs than architects. Team-level optimization should address each persona.

For Software Engineers

Common inefficiency patterns:
- Tool definition bloat: Defining many tools with verbose descriptions when the agent only uses 2-3 regularly
- Context window leaks: Accumulating conversation history across sessions unnecessarily
- Code context overloading: Passing entire files when only relevant functions are needed

Standards to adopt:

## Engineering Prompt Standards

### Tool Definitions
- Limit to ≤10 tool definitions per agent
- Each tool description: ≤50 words
- Group related operations into single multi-parameter tools where possible

### Code Context
- Pass function signatures + docstrings first; full implementation only if requested
- Use file path references with line numbers instead of full file content for files >200 lines
- Maximum 3 files in a single context injection; use summarization for additional files

### Conversation Management
- Clear session context after task completion (do not carry conversation history across tasks)
- Use system prompt caching for all agents with static instructions >200 tokens

For QA Engineers

Common inefficiency patterns:
- Test case context: passing full test files when only the failing test matters
- Redundant framework context: re-explaining the testing framework in every prompt
- Coverage report verbosity: including full coverage reports when only uncovered areas are relevant

Standards to adopt:

## QA Prompt Standards

### Test Failure Analysis
- Pass only the failing test and the function under test (not the entire test file)
- Include the error message and stack trace (truncated to 20 lines if longer)
- Reference the test framework once in the system prompt; do not repeat in each user turn

### Test Generation
- Specify coverage targets in the system prompt, not per-request
- Use the `test_coverage_compact.md` context template (shows uncovered lines only)
- For large functions, pass the function signature + inline comments; agent can request full body

### Bug Triage
- Use the `minimal` ticket context depth for initial triage
- Escalate to `standard` only if the agent requests more context

For Product Managers

Common inefficiency patterns:
- Long requirement documents pasted wholesale into prompts
- Re-explaining business context that could live in the system prompt
- Using story generation agents for simple template-fill tasks that need no LLM

Standards to adopt:

## Product Management Prompt Standards

### User Story Generation
- Pre-load team context (tech stack, conventions, sprint velocity) in system prompt
- Pass only the feature description and acceptance criteria in user turns
- Use the `ticket_minimal.md` template for story creation; upgrade to `ticket_standard.md` for refinement

### Sprint Planning
- Use the `sprint_context.md` template for velocity and capacity data
- Limit backlog items passed as context to 15-20 items maximum
- For retrospectives: pass structured action items, not full meeting transcripts

### Document Analysis
- For PRD review: extract key decisions and open questions into structured format before passing to LLM
- Avoid passing raw meeting notes; use the `meeting_summary_compact.md` template

Tip: Run persona-specific optimization workshops once per quarter. Bring together all engineers, QA engineers, and PMs who work with AI agents for a two-hour session. Have each persona share their most common agent tasks and measure the token consumption of their typical prompts live. Peer-to-peer learning — engineers seeing how QA uses agents, PMs seeing what's inside a system prompt — creates cross-functional improvements that no top-down policy could produce.

Measuring Team-Level Optimization Success

Track these metrics to assess whether team-level practices are taking hold:

Adoption metrics:
- % of agents using prompt components from the shared library (target: >80%)
- % of prompt PRs that include token count delta (target: 100%)
- Number of experiments run per month per team

Outcome metrics:
- Month-over-month change in cost per unit of work (adjusted for volume and task complexity)
- Number of prompt regressions caught in CI vs. caught in production
- Time from "optimization hypothesis identified" to "shipped and measured"

Knowledge sharing metrics:
- Number of new entries added to the "What Works" registry per month
- Number of teams that adopted an optimization technique discovered by another team
- Average quality score across all agents (tracked over time)

Tip: Post a monthly "Token Optimization Leaderboard" showing cost-per-outcome for each agent, ordered from most efficient to least. Teams consistently improve metrics that are visible and attributed to them. The leaderboard should not be punitive — it should celebrate the most efficient agents and create positive peer pressure toward improvement.

Summary

Team-level optimization multiplies the impact of individual work by creating shared infrastructure and shared knowledge. The key components are:

A versioned, metadata-rich shared prompt library with composition tooling
Context templates that standardize token-efficient information passing
A prompt review process with CI-enforced linting
Weekly 15-minute optimization check-ins using the 3-2-1 format
Persona-specific standards for engineers, QA, and product managers
Visibility mechanisms (leaderboards, dashboards, Slack digests) that create cultural accountability