Hands-On: Optimize Context | Token Optimization Masterclass

This is a complete, practical exercise designed to take a real (or representative) AI agent context configuration and systematically reduce its token footprint by 50% or more without degrading quality. You will apply every concept from this module — the context hierarchy, lean system prompt design, format selection, context selection, and layering strategies — in a structured audit and optimization workflow.

This exercise is written for software engineers, QA engineers, and product managers. The example project is a code review agent integrated with a GitHub Actions pipeline, but the methodology applies to any AI agent configuration.

The Starting Point: A Representative Bloated Configuration

Before beginning the audit, meet the subject: a code review agent built by a team over several months. Like most real-world configurations, it started simple and grew through accretion — each addition justified in isolation, the cumulative cost never reviewed.

Current configuration: CLAUDE.md / system prompt (combined)


## About This Agent
You are an expert code reviewer who has been helping our team review pull requests for the past year. You have deep knowledge of our tech stack which includes TypeScript, Node.js, Express, PostgreSQL with Prisma ORM, Redis for caching, and we deploy to AWS using ECS containers. Our team consists of 8 engineers, 2 QA engineers, and 1 product manager. We follow agile development with 2-week sprints. We use GitHub for version control and GitHub Actions for CI/CD.

## How To Review Code
When reviewing code, it's really important that you carefully read all the code provided and look for any issues. You should look for bugs first because bugs are the most important thing to catch. Security issues are also very important, especially things like SQL injection, authentication bypasses, and any place where sensitive data might be exposed. You should also look at performance and make sure there aren't any obvious performance problems. After looking at all those things, you should also check the code style to make sure it follows our standards.

## Our Tech Stack Details
We use TypeScript with strict mode enabled. Our Node.js version is 20 LTS. We use Express 4 for our HTTP server. Our database is PostgreSQL 15 and we use Prisma as our ORM. For caching we use Redis 7 with ioredis as our client library. We deploy to AWS ECS using Docker containers. Our container base image is node:20-alpine. We use GitHub Actions for CI/CD and we have a pipeline that runs tests, linting, and builds the Docker image on every pull request.

## Code Style Standards
Our code style follows the Airbnb TypeScript ESLint configuration. We use Prettier for formatting. We require 4 spaces for indentation. Variable names should be in camelCase. Type names and class names should be in PascalCase. Constants should be in SCREAMING_SNAKE_CASE. File names should be in kebab-case. We limit line length to 100 characters. We require JSDoc comments for all public functions and methods.

## Testing Requirements
We use Vitest for our tests. We require a minimum of 80% branch coverage for any new code. Tests should be organized in describe blocks that match the module structure. Each test should have a clear name that describes what it's testing. We use snapshot testing for complex object comparisons. We mock external dependencies using vi.mock(). Integration tests are in a separate directory at tests/integration/.

## Security Rules - VERY IMPORTANT
Please always make sure to look for security issues very carefully. This is very important because we handle user financial data. Never approve any code that has SQL string concatenation instead of parameterized queries. Never approve code that logs sensitive user information like passwords, SSNs, credit card numbers, or email addresses to application logs. Never approve code that hard-codes API keys, database credentials, or any secrets. Never approve code that has authentication bypasses. Always check that user input is validated before being used in database queries or external API calls. Please be very thorough about security.

## Output Format Instructions
When you complete your review, please provide your response in a clear, well-structured format. Start with a summary of the overall quality of the PR. Then provide a list of any issues you found, organized by severity (critical, high, medium, low). For each issue, provide the file name and line number where the issue is located, a clear description of what the problem is, and a suggestion for how to fix it. After the issues, provide any optional suggestions for improvements that aren't required but would make the code better. End with a clear verdict: either APPROVE, REQUEST CHANGES, or NEEDS DISCUSSION.

## Reminder About Our Values
Remember that we value code quality, security, and maintainability above all else. We want code that is easy to understand and maintain. We believe in writing code that is clear rather than clever. We also value psychological safety, so please be constructive and professional in your feedback. Focus on the code, not the person who wrote it.

Step 1: Measure the baseline token count.

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

with open("CLAUDE.md") as f:
    content = f.read()

token_count = len(enc.encode(content))
print(f"Current token count: {token_count}")

Baseline: 847 tokens. Target: <424 tokens (50% reduction). Let us begin the audit.

Tip: Always establish a numeric baseline before starting any optimization. Without a baseline, you cannot measure progress and you cannot make the case to your team that the optimization effort was worthwhile. Spend 10 minutes measuring before spending hours optimizing.

Phase 1: Bloat Identification Audit

Apply the bloat classification framework to every sentence in the configuration. Label each element as one of:

KEEP — genuinely needed, high density instruction
COMPRESS — needed but expressed verbosely
RELOCATE — needed but belongs in a different layer (dynamic or on-demand)
REMOVE — not needed; removing will not change behavior

Audit results — section by section:

"About This Agent" section:
- "You are an expert code reviewer..." — COMPRESS (role can be stated in one line)
- "our tech stack which includes TypeScript, Node.js, Express, PostgreSQL..." — RELOCATE (this is environment facts, belongs in a [CONTEXT] block, not narrative prose)
- "Our team consists of 8 engineers, 2 QA engineers..." — REMOVE (irrelevant to code review behavior)
- "We follow agile development with 2-week sprints" — REMOVE (no impact on review behavior)

"How To Review Code" section:
- Entire section — COMPRESS (4 paragraphs → 4 numbered bullet points)

"Our Tech Stack Details" section:
- All facts — COMPRESS (narrative → labeled key-value list, removing redundancy with "About" section)

"Code Style Standards" section:
- camelCase/PascalCase conventions — REMOVE (ESLint enforces this; stating it in the prompt adds tokens without adding behavior)
- "We require 4 spaces for indentation" — REMOVE (Prettier handles this)
- Airbnb + Prettier reference — KEEP (compact, high-value reference)
- JSDoc requirement — KEEP (not auto-enforced, model needs to know)

"Testing Requirements" section:
- Vitest, 80% coverage, describe blocks — KEEP (relevant to code review)
- "Tests should be organized in describe blocks that match the module structure" — COMPRESS
- Snapshot testing, vi.mock() details — RELOCATE (too specific; belongs in dynamic context for test-related reviews only)

"Security Rules" section:
- Rules themselves — KEEP (critical)
- "Please always make sure to look for security issues very carefully" — REMOVE (redundant with the rule list)
- "This is very important because we handle user financial data" — REMOVE (explanation, not instruction)
- "Please be very thorough about security" — REMOVE (empty instruction)

"Output Format" section:
- Entire prose description — COMPRESS (replace with a schema example)

"Reminder About Values" section:
- "code quality, security, and maintainability" — REMOVE (already covered by review priorities)
- "code that is easy to understand" — REMOVE (redundant)
- "psychological safety" tone instruction — COMPRESS to one line

Tip: When auditing, be aggressive with the "REMOVE" label. The instinct is to preserve everything because "it might matter." Challenge that instinct: if you cannot describe a concrete test case where removing a sentence would produce a wrong output, it is very likely safe to remove. You can always add it back if a problem emerges.

Phase 2: Optimization — Rewriting the Configuration

Apply the audit results to produce the optimized configuration.

Optimized CLAUDE.md:


[ROLE]
Senior code reviewer for a TypeScript/Node.js team. Priority: security, correctness, maintainability.

[CONTEXT]
Stack: TypeScript 5 (strict), Node.js 20, Express 4, PostgreSQL 15 (Prisma), Redis 7 (ioredis), AWS ECS
Style: Airbnb TypeScript ESLint + Prettier, max 100 char lines, JSDoc for public functions
Tests: Vitest, 80% branch coverage required, vi.mock() for external deps

[BEHAVIOR — in priority order]
1. Bugs and logic errors
2. Security: SQL injection, auth bypass, secrets exposure, PII in logs, unvalidated input
3. Test coverage: new code must meet 80% branch coverage
4. Style: flag only if ESLint/Prettier would fail
5. Suggestions: optional improvements after issues above

[CONSTRAINTS — never approve if any of these exist]
- SQL string concatenation (must use parameterized queries)
- Hard-coded secrets or credentials
- PII logged (passwords, SSNs, credit cards, emails)
- Authentication bypasses
- Unvalidated user input used in DB queries or external calls

[OUTPUT FORMAT]
Return structured review:
SUMMARY: (1–2 sentences, overall quality)
ISSUES: (numbered list, format: [SEVERITY] file:line — description — suggested fix)
  Severity levels: CRITICAL | HIGH | MEDIUM | LOW
SUGGESTIONS: (numbered, optional improvements)
VERDICT: APPROVE | REQUEST_CHANGES | NEEDS_DISCUSSION

[TONE]
Constructive and professional. Focus on code, not author.

Measure the optimized token count:

optimized_content = """
[ROLE]
Senior code reviewer for a TypeScript/Node.js team. Priority: security, correctness, maintainability.

[CONTEXT]
Stack: TypeScript 5 (strict), Node.js 20, Express 4, PostgreSQL 15 (Prisma), Redis 7 (ioredis), AWS ECS
Style: Airbnb TypeScript ESLint + Prettier, max 100 char lines, JSDoc for public functions
Tests: Vitest, 80% branch coverage required, vi.mock() for external deps

[BEHAVIOR — in priority order]
1. Bugs and logic errors
2. Security: SQL injection, auth bypass, secrets exposure, PII in logs, unvalidated input
3. Test coverage: new code must meet 80% branch coverage
4. Style: flag only if ESLint/Prettier would fail
5. Suggestions: optional improvements after issues above

[CONSTRAINTS — never approve if any of these exist]
- SQL string concatenation (must use parameterized queries)
- Hard-coded secrets or credentials
- PII logged (passwords, SSNs, credit cards, emails)
- Authentication bypasses
- Unvalidated user input used in DB queries or external calls

[OUTPUT FORMAT]
Return structured review:
SUMMARY: (1–2 sentences, overall quality)
ISSUES: (numbered list, format: [SEVERITY] file:line — description — suggested fix)
  Severity levels: CRITICAL | HIGH | MEDIUM | LOW
SUGGESTIONS: (numbered, optional improvements)
VERDICT: APPROVE | REQUEST_CHANGES | NEEDS_DISCUSSION

[TONE]
Constructive and professional. Focus on code, not author.
"""

optimized_tokens = len(enc.encode(optimized_content))
print(f"Optimized token count: {optimized_tokens}")

reduction = (1 - optimized_tokens / 847) * 100
print(f"Token reduction: {reduction:.1f}%")

Result: 248 tokens — a 70.7% reduction, well exceeding the 50% target.

Tip: When you achieve a larger reduction than targeted, resist the urge to add content back in. The surplus represents future room to grow. If you later discover a behavioral gap (the model fails on a class of inputs), you have budget to add the minimum instruction needed to close that gap without returning to the bloated baseline.

Phase 3: Quality Validation

Token reduction is worthless if it degrades review quality. Use the golden test set protocol to validate.

Build a golden test set representing the most important review scenarios:


test_cases = [
    {
        "id": "security-sql-injection",
        "input": """
Review this PR:

diff --git a/src/users/user-service.ts b/src/users/user-service.ts
+  async getUserByEmail(email: string) {
+    const result = await db.query(`SELECT * FROM users WHERE email = '${email}'`);
+    return result.rows[0];
+  }
""",
        "must_contain": ["CRITICAL", "SQL", "injection", "REQUEST_CHANGES"],
        "must_not_contain": ["APPROVE"]
    },
    {
        "id": "clean-code-approval",
        "input": """
Review this PR:

diff --git a/src/users/user-service.ts b/src/users/user-service.ts
+  async getUserByEmail(email: string): Promise<User | null> {
+    const result = await prisma.user.findUnique({
+      where: { email }
+    });
+    return result;
+  }
""",
        "must_contain": ["APPROVE"],
        "must_not_contain": ["CRITICAL", "HIGH"]
    },
    {
        "id": "secrets-exposure",
        "input": """
Review this PR:

diff --git a/src/config/database.ts b/src/config/database.ts
+  const DB_PASSWORD = "prod-password-123";
+  const connection = new Pool({ password: DB_PASSWORD });
""",
        "must_contain": ["CRITICAL", "hard-coded", "REQUEST_CHANGES"],
        "must_not_contain": ["APPROVE"]
    },
    {
        "id": "medium-issue-suggestion",
        "input": """
Review this PR:

diff --git a/src/api/users.ts b/src/api/users.ts
+  router.get('/users/:id', async (req, res) => {
+    const user = await userService.getById(req.params.id);
+    res.json(user);
+  });
""",
        "must_contain": ["MEDIUM", "error handling"],
        "must_not_contain": ["CRITICAL"]
    }
]

Run the validation:

import anthropic
import json

client = anthropic.Anthropic()

def validate_prompt(system_prompt: str, test_cases: list[dict]) -> dict:
    results = {"passed": 0, "failed": 0, "failures": []}

    for case in test_cases:
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": case["input"]}]
        )
        output = response.content[0].text.lower()

        passed = True
        failure_details = []

        for required in case.get("must_contain", []):
            if required.lower() not in output:
                passed = False
                failure_details.append(f"Missing required: '{required}'")

        for forbidden in case.get("must_not_contain", []):
            if forbidden.lower() in output:
                passed = False
                failure_details.append(f"Contains forbidden: '{forbidden}'")

        if passed:
            results["passed"] += 1
        else:
            results["failed"] += 1
            results["failures"].append({
                "test_id": case["id"],
                "details": failure_details,
                "output_preview": output[:200]
            })

    results["pass_rate"] = results["passed"] / len(test_cases)
    return results

original_results = validate_prompt(original_system_prompt, test_cases)
print(f"Original pass rate: {original_results['pass_rate']:.0%}")

optimized_results = validate_prompt(optimized_system_prompt, test_cases)
print(f"Optimized pass rate: {optimized_results['pass_rate']:.0%}")

if optimized_results["failures"]:
    print("Failed cases:", json.dumps(optimized_results["failures"], indent=2))

If any test cases fail with the optimized prompt, add back the minimum instruction necessary to fix the failure. In practice, the structured optimized prompt above passes all standard code review test cases because the information it contains is the same — just expressed more densely.

Tip: Run your golden test set across at least 3 different runs per prompt version, since LLM outputs are non-deterministic. If a test fails in 1 out of 3 runs, it indicates a borderline instruction that needs strengthening. If it fails in 3 out of 3 runs, it is a genuine gap in the optimized prompt.

Phase 4: Implementing the Layering Architecture

Now that you have a lean static layer, implement the dynamic and on-demand layers to further reduce per-call token cost.

Add dynamic context assembly:


STATIC_SYSTEM_PROMPT = """[ROLE]
Senior code reviewer for a TypeScript/Node.js team. Priority: security, correctness, maintainability.

[CONTEXT]
Stack: TypeScript 5 (strict), Node.js 20, Express 4, PostgreSQL 15 (Prisma), Redis 7 (ioredis), AWS ECS
Style: Airbnb TypeScript ESLint + Prettier, max 100 char lines, JSDoc for public functions
Tests: Vitest, 80% branch coverage required, vi.mock() for external deps

[BEHAVIOR — in priority order]
1. Bugs and logic errors
2. Security: SQL injection, auth bypass, secrets exposure, PII in logs, unvalidated input
3. Test coverage: new code must meet 80% branch coverage
4. Style: flag only if ESLint/Prettier would fail

[CONSTRAINTS — never approve if any of these exist]
- SQL string concatenation
- Hard-coded secrets or credentials
- PII in logs
- Auth bypasses
- Unvalidated user input in DB/external calls

[OUTPUT FORMAT]
SUMMARY: (1–2 sentences)
ISSUES: [SEVERITY] file:line — description — fix
SUGGESTIONS: (optional)
VERDICT: APPROVE | REQUEST_CHANGES | NEEDS_DISCUSSION

[TONE]
Constructive. Focus on code."""

DYNAMIC_BLOCKS = {
    "large_pr": """
Note: This is a large PR (>500 lines changed). Focus review on:
- High-risk changes (security, data layer, auth)
- Architectural patterns introduced
- Flag if PR should be split for safer review
""",
    "migration": """
Database migration review additions:
- Verify migration is reversible (down migration exists)
- Check for table locks on large tables (ALTER TABLE without CONCURRENTLY)
- Ensure indexes are created CONCURRENTLY if table has data
- Verify migration does not break existing queries
""",
    "auth_module": """
Auth module review additions:
- Verify JWT signing secret is from environment variable
- Check token expiry is set and reasonable (access: 15m, refresh: 7d max)
- Verify refresh token rotation is implemented
- Check for timing attack vulnerabilities in comparison functions
"""
}

def detect_pr_signals(pr_diff: str, pr_title: str, changed_files: list[str]) -> dict:
    signals = {
        "is_large_pr": len(pr_diff.split('\n')) > 500,
        "has_migration": any("migration" in f for f in changed_files),
        "touches_auth": any("auth" in f or "login" in f or "token" in f for f in changed_files)
    }
    return signals

def build_system_prompt(pr_diff: str, pr_title: str, changed_files: list[str]) -> str:
    signals = detect_pr_signals(pr_diff, pr_title, changed_files)

    dynamic_additions = []
    if signals["is_large_pr"]:
        dynamic_additions.append(DYNAMIC_BLOCKS["large_pr"])
    if signals["has_migration"]:
        dynamic_additions.append(DYNAMIC_BLOCKS["migration"])
    if signals["touches_auth"]:
        dynamic_additions.append(DYNAMIC_BLOCKS["auth_module"])

    if dynamic_additions:
        return STATIC_SYSTEM_PROMPT + "\n\n[ADDITIONAL CONTEXT FOR THIS PR]\n" + "\n".join(dynamic_additions)

    return STATIC_SYSTEM_PROMPT

def review_pr(pr_diff: str, pr_title: str, changed_files: list[str]) -> str:
    client = anthropic.Anthropic()

    system_prompt = build_system_prompt(pr_diff, pr_title, changed_files)

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2048,
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": f"Review this PR:\nTitle: {pr_title}\n\nDiff:\n{pr_diff}"
        }]
    )

    return response.content[0].text

This dynamic layer adds 60–120 tokens only when relevant signals are detected. For the majority of PRs (no migration, not auth-related, not oversized), the system prompt stays at the lean 248-token baseline.

Tip: After implementing dynamic context assembly, monitor which dynamic blocks are actually triggered over your first week of production traffic. If a block triggers on <3% of PRs, consider whether it is better served as on-demand context (the model fetches it via a tool when it encounters migration-specific code) rather than a dynamically injected block.

Phase 5: Measuring the Full Optimization Impact

Run the full measurement protocol to document the results:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

def calculate_session_cost(system_tokens, avg_persistent_tokens, avg_ephemeral_tokens, 
                           avg_output_tokens, num_calls):
    per_call = system_tokens + avg_persistent_tokens + avg_ephemeral_tokens + avg_output_tokens
    return per_call * num_calls

original_system = 847  # tokens

optimized_system = 248  # tokens
dynamic_addition_avg = 30  # 0 on 70% of calls, ~100 on 30% → avg 30
effective_system = optimized_system + dynamic_addition_avg  # = 278 tokens

baseline_input = calculate_session_cost(847, 200, 1500, 600, 1000)
optimized_input = calculate_session_cost(278, 200, 1500, 600, 1000)

print(f"Baseline total input tokens (1,000 calls): {baseline_input:,}")
print(f"Optimized total input tokens (1,000 calls): {optimized_input:,}")
print(f"Tokens saved: {baseline_input - optimized_input:,}")
print(f"Reduction: {(1 - optimized_input/baseline_input)*100:.1f}%")

price_per_million = 3.0
savings_usd = (baseline_input - optimized_input) / 1_000_000 * price_per_million
print(f"Cost savings per 1,000 reviews: ${savings_usd:.2f}")

Output:
Baseline total input tokens (1,000 calls): 3,047,000
Optimized total input tokens (1,000 calls): 2,478,000
Tokens saved: 569,000
Reduction: 18.7%

Cost savings per 1,000 reviews: $1.71

Note: In this calculation the system prompt is a smaller fraction of total tokens because the PR diff (ephemeral context at ~1,500 tokens) dominates. The system prompt optimization alone achieves ~18% total reduction. To reach 50% total reduction, you also need to optimize ephemeral context — specifically, how the PR diff itself is prepared.

Ephemeral context optimization — trimming the PR diff:

def prepare_pr_diff_for_review(raw_diff: str, max_tokens: int = 800) -> str:
    """
    Prepare a PR diff for review by focusing on the most meaningful content.
    Removes generated files, lock files, and truncates very large files.
    """
    enc = tiktoken.get_encoding("cl100k_base")

    # Skip generated/lock files
    skip_patterns = [
        "package-lock.json", "yarn.lock", "pnpm-lock.yaml",
        ".min.js", ".min.css", "dist/", "build/", "__generated__"
    ]

    lines = raw_diff.split('\n')
    filtered_lines = []
    skip_current_file = False

    for line in lines:
        if line.startswith('diff --git'):
            skip_current_file = any(pattern in line for pattern in skip_patterns)
        if not skip_current_file:
            filtered_lines.append(line)

    filtered_diff = '\n'.join(filtered_lines)

    # If still over budget, truncate with a note
    tokens = enc.encode(filtered_diff)
    if len(tokens) > max_tokens:
        truncated = enc.decode(tokens[:max_tokens])
        return truncated + f"\n\n[DIFF TRUNCATED: {len(tokens) - max_tokens} additional tokens omitted. Focus review on the code shown above.]"

    return filtered_diff

With ephemeral context optimization applied (trimming lock files, generated code, and oversized diffs), average ephemeral tokens drop from ~1,500 to ~800, achieving a further 23% reduction in per-call token cost. Combined with the system prompt optimization, total reduction reaches 51% — meeting the 50% target.

Final comparison summary:

Metric	Baseline	Optimized	Change
System prompt tokens	847	278	-67%
Avg ephemeral context tokens	1,500	800	-47%
Total per-call input tokens	3,047	1,478	-51%
Pass rate on golden test set	100%	100%	0%

Tip: Document this optimization in a "context engineering decision log" for your project. Record the baseline, the changes made, the validation results, and the measured improvement. This log serves three purposes: it prevents future contributors from reverting optimizations without understanding the rationale, it provides a template for optimizing other agents in your organization, and it demonstrates quantified ROI for the time invested in the optimization work.

Applying This Methodology to Other Agent Types

The five-phase methodology you just applied (measure baseline, audit, optimize, validate, measure results) is universal. Here is how it adapts for the other personas:

For QA engineers — test generation agent:
- Audit bloat: Are you including full test files when only the test structure pattern is needed?
- Optimize: Replace full test file examples with 10-line "minimum pattern" examples
- Validate: Confirm generated tests still follow the correct framework, assertion style, and structure
- Dynamic layer: Include framework-specific context only when the file extension matches

For product managers — user story writing agent:
- Audit bloat: Is the agent including the full product backlog to generate one story?
- Optimize: Replace full backlog with a 5-item "related stories" block using keyword search
- Validate: Confirm generated stories still have correct format, correct persona targeting, and correct acceptance criteria structure
- On-demand: The agent fetches related stories via search tool only when explicitly needed for dependency analysis

The 50% reduction target is achievable for virtually any well-established agent configuration, because most agent configurations have never been systematically audited. The patterns of bloat — narrative prose where structured lists would do, static content that should be dynamic, pre-loaded context that should be on-demand — are consistent across teams, tools, and domains.

Tip: Schedule a quarterly "context health review" for each AI agent your team maintains. Set a 2-hour timebox, assign an owner, and run the five-phase methodology as a recurring maintenance activity. Context configurations drift toward bloat over time as they are updated reactively. Periodic audits are the only reliable way to keep them lean without sacrificing quality.