Hands-On: Optimize Workflow | Token Optimization Masterclass

Overview and Setup

This hands-on topic is structured as a step-by-step optimization exercise. You are given a real, working agentic workflow — a code review agent — and your task is to apply the techniques from this module to reduce its token consumption by at least 40% without changing what it produces.

The exercise uses LangGraph as the primary framework with LangChain tool definitions. The concepts apply directly to CrewAI, AutoGen, and OpenAI Assistants.

Baseline workflow: A code review agent that:
1. Takes a GitHub repository URL
2. Reads all Python files
3. Analyzes each file for code quality issues (complexity, naming, error handling, test coverage)
4. Generates a comprehensive code review report
5. Suggests specific improvements

By the end of this topic you will have applied six optimizations, measured each one, and produced a version of the workflow that uses approximately 38–45% fewer tokens on the benchmark task.

Step 0: Establish the Baseline

Before optimizing, instrument the baseline so you can measure your progress. Set up your environment:

pip install langchain langchain-openai langgraph tiktoken sentence-transformers redis

Baseline Workflow Code

import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Annotated
import operator
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    return len(enc.encode(str(text)))

BASELINE_SYSTEM_PROMPT = """
You are an expert software engineer with 20 years of experience in Python development,
software architecture, code quality, testing, and security. You have deep expertise in:
- PEP 8 and PEP 20 style guidelines
- Object-oriented design patterns and SOLID principles
- Functional programming patterns in Python
- Testing methodologies including TDD, BDD, unit testing, integration testing
- Security best practices including OWASP guidelines
- Performance optimization and profiling
- Code maintainability and technical debt
- Documentation standards including Google docstrings, Numpy docstrings, Sphinx
- Type annotations and mypy
- Async Python programming
- Database access patterns and ORM usage
- API design principles (REST, GraphQL, gRPC)
- Microservices architecture
- Container and deployment patterns

When reviewing code, you should:
1. Read through each file carefully and completely
2. Consider the file in the context of all other files you have read
3. Think about architectural implications
4. Consider edge cases and error scenarios
5. Evaluate test coverage
6. Assess documentation quality
7. Review security implications
8. Consider performance characteristics
9. Evaluate code maintainability
10. Assess adherence to team conventions

Provide thorough, detailed feedback that covers all aspects of code quality.
Always explain your reasoning. Always provide specific line references.
Always provide concrete suggestions for improvement.
After reviewing ALL files, produce a comprehensive report.
"""

class BaselineAgentState(TypedDict):
    repo_path: str
    files: List[dict]           # Holds ALL file contents throughout session
    current_file_index: int
    review_comments: List[str]  # Raw text comments accumulate
    final_report: str
    total_tokens_used: int

def read_all_files_node(state: BaselineAgentState) -> BaselineAgentState:
    """Read all Python files upfront and store them ALL in state."""
    import glob
    files = []
    for filepath in glob.glob(f"{state['repo_path']}/**/*.py", recursive=True):
        with open(filepath, 'r') as f:
            content = f.read()
        files.append({
            "path": filepath,
            "content": content  # FULL CONTENT stored in state
        })
    state["files"] = files
    state["current_file_index"] = 0
    return state

def review_file_node(state: BaselineAgentState) -> BaselineAgentState:
    """Review one file — but sends ALL prior context (all files + all comments)."""
    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    idx = state["current_file_index"]
    current_file = state["files"][idx]

    # PROBLEM: All prior files and comments are in the message history
    # This is sent on every iteration
    messages = [
        SystemMessage(content=BASELINE_SYSTEM_PROMPT),
        HumanMessage(content=f"You are reviewing a Python repository. "
                             f"Here are ALL the files in the project: "
                             f"{[f['content'] for f in state['files']]}"),  # ALL files every time
        HumanMessage(content=f"Previous review comments: {state['review_comments']}"),
        HumanMessage(content=f"Now provide detailed review for: {current_file['path']}\n"
                             f"{current_file['content']}")
    ]

    # Track tokens
    input_tokens = sum(count_tokens(str(m.content)) for m in messages)

    response = llm.invoke(messages)
    output_tokens = count_tokens(response.content)

    state["review_comments"].append(response.content)  # Raw verbose output accumulates
    state["current_file_index"] += 1
    state["total_tokens_used"] += input_tokens + output_tokens

    return state

def generate_report_node(state: BaselineAgentState) -> BaselineAgentState:
    """Generate final report — sends everything."""
    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    messages = [
        SystemMessage(content=BASELINE_SYSTEM_PROMPT),
        HumanMessage(content=f"All files reviewed: {[f['content'] for f in state['files']]}"),
        HumanMessage(content=f"All review comments: {state['review_comments']}"),
        HumanMessage(content="Produce the final comprehensive code review report.")
    ]

    input_tokens = sum(count_tokens(str(m.content)) for m in messages)
    response = llm.invoke(messages)
    output_tokens = count_tokens(response.content)

    state["final_report"] = response.content
    state["total_tokens_used"] += input_tokens + output_tokens
    return state

def should_continue_reviewing(state: BaselineAgentState):
    if state["current_file_index"] >= len(state["files"]):
        return "generate_report"
    return "review_file"

baseline_builder = StateGraph(BaselineAgentState)
baseline_builder.add_node("read_files", read_all_files_node)
baseline_builder.add_node("review_file", review_file_node)
baseline_builder.add_node("generate_report", generate_report_node)

baseline_builder.set_entry_point("read_files")
baseline_builder.add_edge("read_files", "review_file")
baseline_builder.add_conditional_edges(
    "review_file",
    should_continue_reviewing,
    {"review_file": "review_file", "generate_report": "generate_report"}
)
baseline_builder.add_edge("generate_report", END)

baseline_graph = baseline_builder.compile()

Benchmark: Run Baseline and Record

Run the baseline on a representative 8-file Python project and record:

Metric	Baseline Value
Total input tokens	~85,000
Total output tokens	~12,000
Total tokens	~97,000
Iterations	10 (8 reviews + 1 read + 1 report)
Completion time	~45 seconds

Tip: Before optimizing, run your baseline workflow on at least three different representative tasks and average the token counts. Optimizations that look impressive on one task may vary significantly on others. Use the average as your baseline target to beat.

Optimization 1: Focused System Prompt (Target: -15% tokens)

The baseline system prompt is 1,847 tokens. It is re-sent on every LLM call. For 10 calls, that is 18,470 tokens just for the system prompt.

Replace it with a focused, role-specific prompt:

FOCUSED_SYSTEM_PROMPT = """
You are a Python code reviewer. For each file provided, return a JSON object:
{
  "file": "<filename>",
  "issues": [
    {"severity": "high|medium|low", "category": "security|performance|quality|testing",
     "line": <line_number>, "description": "<concise description>", "fix": "<specific fix>"}
  ],
  "summary": "<2-3 sentence overall assessment>"
}

Focus on actionable issues only. Omit style issues unless they affect readability significantly.
"""

Measure the saving:

System prompt tokens saved: (1,847 - 287) × 10 calls = 15,600 tokens saved
Percentage of baseline: 15,600 / 97,000 = 16.1%
Running total saved: ~16%

Optimization 2: File-by-File Scoped Sessions (Target: -25% additional)

The baseline sends ALL files to every review iteration. Replace this with scoped sessions — each file is reviewed in complete isolation.


def review_single_file_scoped(file_path: str, file_content: str) -> dict:
    """Review exactly one file in a fresh, scoped context."""
    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    messages = [
        SystemMessage(content=FOCUSED_SYSTEM_PROMPT),  # 287 tokens (not 1,847)
        HumanMessage(content=f"""
        Review this file and return a JSON findings object.

        File: {file_path}

        ```python
        {file_content}
        ```
        """)
    ]

    # Token cost: 287 (sys) + ~file_tokens + ~200 (task) = scoped
    # NOT: 1,847 + all_other_files + all_prior_comments

    response = llm.invoke(messages)
    return parse_json_response(response.content)

def scoped_review_all_files(files: list[dict]) -> list[dict]:
    return [
        review_single_file_scoped(f["path"], f["content"])
        for f in files
    ]

Measure the saving vs. baseline:

Baseline per-iteration context:
  - System prompt: 1,847 tokens
  - All 8 files (avg 1,200 tokens each): 9,600 tokens
  - Accumulating prior comments: 0 to 8,000 tokens
  Average per iteration: ~15,300 tokens input
  × 8 iterations: ~122,400 tokens

Optimized per-iteration context:
  - System prompt: 287 tokens
  - Single file (avg 1,200 tokens): 1,200 tokens
  - Task spec: 200 tokens
  Average per iteration: ~1,687 tokens input
  × 8 iterations: ~13,500 tokens

Input token savings from scoping: ~108,900 tokens
(Replaces previous estimate — baseline calculation adjusted for actual accumulation)

Key insight: running total savings now ~40%+ just from Opts 1 and 2

Tip: The scoped session pattern (Optimization 2) is consistently the highest-impact optimization for file-processing agents. The reason is mathematical: baseline sends N files × average_file_size tokens on each of N iterations = O(N²) cost, while scoped sends 1 file per iteration = O(N) cost. For 8 files, that is a 7x reduction in file-content tokens alone.

Optimization 3: Structured Output with Compressed Handoffs (Target: -10% additional)

The baseline review comments are verbose prose (avg 800 tokens each), and they accumulate in state. Replace with the structured JSON output already introduced in Optimization 1.

Now enforce that only structured summaries (not full prose) are passed forward:

class OptimizedState(TypedDict):
    repo_path: str
    files_to_process: List[dict]   # Consumed one at a time, not stored en masse
    findings: List[dict]            # Compact JSON findings (avg ~120 tokens each)
    final_report: str
    token_ledger: dict

def accumulate_findings_node(state: OptimizedState) -> OptimizedState:
    """Process next file, store only compact findings."""
    if not state["files_to_process"]:
        return state

    next_file = state["files_to_process"].pop(0)
    finding = review_single_file_scoped(next_file["path"], next_file["content"])

    # Store the compact finding (~120 tokens), not the verbose output (~800 tokens)
    state["findings"].append(finding)
    return state

For the report generation step, the orchestrator now receives only compact findings:

def generate_optimized_report(findings: list[dict]) -> str:
    """Generate report from structured findings — minimal context."""
    import json
    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    findings_json = json.dumps(findings, indent=2)
    # findings_json is ~960 tokens (8 findings × 120 tokens)
    # vs. baseline: ~6,400 tokens of verbose comments

    messages = [
        SystemMessage(content=FOCUSED_SYSTEM_PROMPT),
        HumanMessage(content=f"""
        Produce a code review report from these structured findings.
        Group by severity. Include an executive summary and prioritized action items.

        Findings:
        {findings_json}
        """)
    ]

    response = llm.invoke(messages)
    return response.content

Token saving from this optimization:

Report generation input:
  Baseline: 1,847 (sys) + 9,600 (all files) + 6,400 (verbose comments) = 17,847
  Optimized: 287 (sys) + 960 (compact findings) + 300 (task) = 1,547
  Savings: 16,300 tokens on report generation alone

Across all iterations, structured handoffs save:
  (800 - 120) tokens × 8 files = 5,440 tokens in accumulated state
  Not re-sent each iteration (scoped), but saves on final report step

Running cumulative savings vs. full baseline: ~47%

Optimization 4: Tool Result Caching (Target: -5% on repeated runs)

For the file reading step, implement caching based on file path + modification time:

import os
import hashlib

file_content_cache = {}

def read_file_cached(filepath: str) -> str:
    """Read file with caching based on path + mtime."""
    mtime = os.path.getmtime(filepath)
    cache_key = f"{filepath}:{mtime}"

    if cache_key not in file_content_cache:
        with open(filepath, 'r') as f:
            file_content_cache[cache_key] = f.read()

    return file_content_cache[cache_key]

review_result_cache = {}

def review_with_cache(file_path: str, file_content: str) -> dict:
    """Review file with result caching — skip if file unchanged."""
    content_hash = hashlib.md5(file_content.encode()).hexdigest()
    cache_key = f"review:{content_hash}"

    if cache_key in review_result_cache:
        print(f"Cache hit for {file_path}")
        return review_result_cache[cache_key]

    result = review_single_file_scoped(file_path, file_content)
    review_result_cache[cache_key] = result
    return result

On the first run, caching provides no token savings (cold cache). On re-runs (e.g., re-running the review after fixing one file), caching eliminates all LLM calls for unchanged files.

For a project where 2 of 8 files changed between runs:

Without cache: 8 review calls × ~1,687 tokens = 13,496 tokens
With cache: 2 review calls × ~1,687 tokens = 3,374 tokens
Savings on re-run: 10,122 tokens (75% reduction for re-runs)

Optimization 5: Early Termination on Empty Findings (Target: -3% average)

Add exit conditions that skip the full review loop for trivial files:

def quick_triage(file_path: str, file_content: str) -> str:
    """Fast triage to classify file before full review."""
    line_count = len(file_content.splitlines())

    # Skip files that are trivially simple
    if line_count < 15:
        return "skip"  # Config files, __init__.py, etc.

    # Skip files that are likely auto-generated
    autogen_markers = ["# AUTO-GENERATED", "# DO NOT EDIT", "# Generated by"]
    if any(marker in file_content[:500] for marker in autogen_markers):
        return "skip"

    return "review"

def optimized_review_loop(files: list[dict]) -> list[dict]:
    """Review loop with early termination for trivial files."""
    findings = []
    skipped = 0

    for file_info in files:
        triage_result = quick_triage(file_info["path"], file_info["content"])

        if triage_result == "skip":
            skipped += 1
            findings.append({
                "file": file_info["path"],
                "issues": [],
                "summary": "File skipped (trivial or auto-generated)",
                "skipped": True
            })
            continue

        finding = review_with_cache(file_info["path"], file_info["content"])
        findings.append(finding)

    print(f"Reviewed {len(files) - skipped} files, skipped {skipped}")
    return findings

In a typical project with 8 Python files, 1–2 are typically trivial (__init__.py, config constants):

2 files skipped × 1,687 tokens = 3,374 tokens saved
As percentage of optimized total: ~4%

Tip: The triage step itself costs tokens (you are running logic on the file). Keep triage logic purely in Python (no LLM call) wherever possible. Line count thresholds, marker string searches, and file name patterns are all computable without LLM tokens. Only use a lightweight LLM call for triage if the Python-only criteria are insufficient for your use case.

Optimization 6: Parallel Execution of Independent Reviews (Latency Reduction)

While not directly reducing total tokens, parallelizing independent file reviews reduces wall-clock time significantly and can be combined with streaming to improve user-perceived performance:

import asyncio
from langchain_openai import ChatOpenAI

async def review_file_async(file_info: dict) -> dict:
    """Async version of scoped file review."""
    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    messages = [
        SystemMessage(content=FOCUSED_SYSTEM_PROMPT),
        HumanMessage(content=f"File: {file_info['path']}\n\n```python\n{file_info['content']}\n```")
    ]

    response = await llm.ainvoke(messages)
    return parse_json_response(response.content)

async def parallel_review(files: list[dict]) -> list[dict]:
    """Review all files in parallel — same total tokens, much less latency."""
    # Batch to avoid rate limits: process 3 files at a time
    batch_size = 3
    all_findings = []

    for i in range(0, len(files), batch_size):
        batch = files[i:i + batch_size]
        batch_results = await asyncio.gather(*[
            review_file_async(f) for f in batch
        ])
        all_findings.extend(batch_results)

    return all_findings

Baseline time (sequential, no parallelism): ~45 seconds
Optimized time (parallel batches of 3):     ~18 seconds
Time reduction: ~60%
Token count: unchanged (parallelism doesn't reduce tokens)

Putting It All Together: The Fully Optimized Workflow

import asyncio
import glob
from typing import TypedDict, List

class OptimizedReviewState(TypedDict):
    repo_path: str
    file_findings: List[dict]
    final_report: str
    token_metrics: dict

def build_optimized_review_workflow():
    """Assembles all 6 optimizations into a complete workflow."""

    async def run_optimized_review(repo_path: str) -> dict:
        # Step 1: Discover files
        all_files = []
        for filepath in glob.glob(f"{repo_path}/**/*.py", recursive=True):
            content = read_file_cached(filepath)
            all_files.append({"path": filepath, "content": content})

        # Step 2: Triage (Python-only, no LLM tokens)
        review_queue = [f for f in all_files 
                        if quick_triage(f["path"], f["content"]) == "review"]

        # Step 3: Parallel scoped reviews (Optimizations 1, 2, 3, 5)
        findings = await parallel_review(review_queue)

        # Step 4: Compact report generation (Optimization 3)
        final_report = generate_optimized_report(findings)

        return {
            "findings": findings,
            "final_report": final_report
        }

    return run_optimized_review

Final Measurement: Comparing Baseline vs. Optimized

Run both workflows on the same 8-file benchmark project:

Metric	Baseline	Optimized	Change
System prompt tokens	18,470	2,870	-84.4%
File content tokens (input)	62,400	9,600	-84.6%
Accumulated comment tokens	14,400	960	-93.3%
Report generation tokens	17,847	1,547	-91.3%
Total input tokens	85,000	14,977	-82.4%
Output tokens	12,000	4,800	-60.0%
Total tokens	97,000	19,777	-79.6%
Wall-clock time	~45 sec	~18 sec	-60%
Output quality	Baseline	Equivalent	—

The optimized workflow delivers a 79.6% token reduction — far exceeding the 40% target — while producing equivalent output quality. The key contributors:

Focused system prompt: -16% of baseline total
Scoped sessions (no cross-file accumulation): -45% of baseline total
Structured handoffs: -12% of baseline total
Early triage termination: -4% of baseline total

Tip: The 40% target in this exercise is intentionally conservative. Well-applied optimizations on typical agentic workflows routinely produce 60–80% token reductions. The 40% framing is useful for stakeholder conversations — it is a credible, conservative floor. When you achieve 70%+, you have room to trade some efficiency back for additional quality checks or more thorough analysis while still meeting the 40% target.

Exercise Variations for Different Personas

For Software Engineers

Extend the optimized workflow to handle multi-language repositories (Python + JavaScript + Go). The key constraint: each language's review agent should have a language-specific scoped system prompt, not a generic polyglot prompt. Measure whether the per-language specialization produces better findings quality vs. a single generic reviewer.

For QA Engineers

Adapt the pattern to an automated test generation agent. The baseline generates tests for all functions in one large session. Optimize it using the same six techniques, with the additional constraint that the test runner results must be incorporated into the review loop (making caching conditional on test pass/fail status).

For Product Managers

The same six optimizations apply to a product requirements analysis agent. Baseline: upload a 50-page PRD and ask the agent to extract all requirements, identify gaps, and prioritize features — in one large session. Optimized: decompose into section-by-section extraction (scoped), compact JSON requirement objects (structured handoffs), and a final synthesis from compact requirements (thin orchestrator). Document the before/after token costs and present them as an ROI argument for adopting the optimized approach.

Troubleshooting Common Issues in the Optimization Process

Issue: Structured JSON output is malformed 15% of the time
Solution: Use a schema-enforced output parser with retry logic. Do not simply json.loads() the response.

from langchain_core.output_parsers import JsonOutputParser
from langchain.output_parsers import RetryOutputParser

base_parser = JsonOutputParser(pydantic_object=FindingsSchema)
retry_parser = RetryOutputParser.from_llm(parser=base_parser, llm=llm)

Issue: Cache hit rate is lower than expected
Solution: Add argument normalization before computing cache keys. File paths that differ only in ./ prefix or absolute vs. relative form are common sources of false cache misses.

Issue: Parallel requests are hitting rate limits
Solution: Implement exponential backoff and reduce batch size. Start with batch_size=2 for GPT-4 class models and batch_size=5 for faster/cheaper models.

Issue: Some files are being incorrectly skipped by triage
Solution: Log all triage decisions and review the skip list after each run. Adjust the line count threshold based on your codebase's characteristics — the 15-line threshold may be too aggressive for some projects.

Tip: When deploying the optimized workflow to production, implement a "shadow mode" for the first two weeks: run both the baseline and optimized workflows on the same inputs, compare outputs, and measure quality divergence. Acceptable quality thresholds depend on your use case (security review agents should have near-zero quality divergence; style review agents can tolerate more). Shadow mode builds organizational trust in the optimization and catches edge cases before they affect users.

Summary

This hands-on exercise demonstrated that applying the techniques from this module in sequence produces dramatic token reductions well beyond the 40% target:

Optimization 1 (focused system prompt): eliminates the largest fixed per-iteration tax
Optimization 2 (scoped sessions): eliminates the quadratic accumulation of cross-file context
Optimization 3 (structured handoffs): compresses inter-step communication by 85%+
Optimization 4 (caching): eliminates redundant work on re-runs
Optimization 5 (early termination): skips trivially simple files
Optimization 6 (parallelism): reduces latency without changing token count

The optimizations are compositional — each one builds on the others. Applied together, they transform an unoptimized agentic workflow from a token-expensive prototype into a production-ready system that delivers the same quality results at a fraction of the cost.

The measurement framework established in Step 0 (baseline instrumentation) is as important as any individual optimization. Without it, you are optimizing by intuition. With it, you are making data-driven architectural decisions — the engineering foundation for sustainable, cost-efficient agentic systems.