Overview and Setup
This hands-on topic is structured as a step-by-step optimization exercise. You are given a real, working agentic workflow — a code review agent — and your task is to apply the techniques from this module to reduce its token consumption by at least 40% without changing what it produces.
The exercise uses LangGraph as the primary framework with LangChain tool definitions. The concepts apply directly to CrewAI, AutoGen, and OpenAI Assistants.
Baseline workflow: A code review agent that:
1. Takes a GitHub repository URL
2. Reads all Python files
3. Analyzes each file for code quality issues (complexity, naming, error handling, test coverage)
4. Generates a comprehensive code review report
5. Suggests specific improvements
By the end of this topic you will have applied six optimizations, measured each one, and produced a version of the workflow that uses approximately 38–45% fewer tokens on the benchmark task.
Step 0: Establish the Baseline
Before optimizing, instrument the baseline so you can measure your progress. Set up your environment:
pip install langchain langchain-openai langgraph tiktoken sentence-transformers redis
Baseline Workflow Code
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Annotated
import operator
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
def count_tokens(text: str) -> int:
return len(enc.encode(str(text)))
BASELINE_SYSTEM_PROMPT = """
You are an expert software engineer with 20 years of experience in Python development,
software architecture, code quality, testing, and security. You have deep expertise in:
- PEP 8 and PEP 20 style guidelines
- Object-oriented design patterns and SOLID principles
- Functional programming patterns in Python
- Testing methodologies including TDD, BDD, unit testing, integration testing
- Security best practices including OWASP guidelines
- Performance optimization and profiling
- Code maintainability and technical debt
- Documentation standards including Google docstrings, Numpy docstrings, Sphinx
- Type annotations and mypy
- Async Python programming
- Database access patterns and ORM usage
- API design principles (REST, GraphQL, gRPC)
- Microservices architecture
- Container and deployment patterns
When reviewing code, you should:
1. Read through each file carefully and completely
2. Consider the file in the context of all other files you have read
3. Think about architectural implications
4. Consider edge cases and error scenarios
5. Evaluate test coverage
6. Assess documentation quality
7. Review security implications
8. Consider performance characteristics
9. Evaluate code maintainability
10. Assess adherence to team conventions
Provide thorough, detailed feedback that covers all aspects of code quality.
Always explain your reasoning. Always provide specific line references.
Always provide concrete suggestions for improvement.
After reviewing ALL files, produce a comprehensive report.
"""
class BaselineAgentState(TypedDict):
repo_path: str
files: List[dict] # Holds ALL file contents throughout session
current_file_index: int
review_comments: List[str] # Raw text comments accumulate
final_report: str
total_tokens_used: int
def read_all_files_node(state: BaselineAgentState) -> BaselineAgentState:
"""Read all Python files upfront and store them ALL in state."""
import glob
files = []
for filepath in glob.glob(f"{state['repo_path']}/**/*.py", recursive=True):
with open(filepath, 'r') as f:
content = f.read()
files.append({
"path": filepath,
"content": content # FULL CONTENT stored in state
})
state["files"] = files
state["current_file_index"] = 0
return state
def review_file_node(state: BaselineAgentState) -> BaselineAgentState:
"""Review one file — but sends ALL prior context (all files + all comments)."""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
idx = state["current_file_index"]
current_file = state["files"][idx]
# PROBLEM: All prior files and comments are in the message history
# This is sent on every iteration
messages = [
SystemMessage(content=BASELINE_SYSTEM_PROMPT),
HumanMessage(content=f"You are reviewing a Python repository. "
f"Here are ALL the files in the project: "
f"{[f['content'] for f in state['files']]}"), # ALL files every time
HumanMessage(content=f"Previous review comments: {state['review_comments']}"),
HumanMessage(content=f"Now provide detailed review for: {current_file['path']}\n"
f"{current_file['content']}")
]
# Track tokens
input_tokens = sum(count_tokens(str(m.content)) for m in messages)
response = llm.invoke(messages)
output_tokens = count_tokens(response.content)
state["review_comments"].append(response.content) # Raw verbose output accumulates
state["current_file_index"] += 1
state["total_tokens_used"] += input_tokens + output_tokens
return state
def generate_report_node(state: BaselineAgentState) -> BaselineAgentState:
"""Generate final report — sends everything."""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
messages = [
SystemMessage(content=BASELINE_SYSTEM_PROMPT),
HumanMessage(content=f"All files reviewed: {[f['content'] for f in state['files']]}"),
HumanMessage(content=f"All review comments: {state['review_comments']}"),
HumanMessage(content="Produce the final comprehensive code review report.")
]
input_tokens = sum(count_tokens(str(m.content)) for m in messages)
response = llm.invoke(messages)
output_tokens = count_tokens(response.content)
state["final_report"] = response.content
state["total_tokens_used"] += input_tokens + output_tokens
return state
def should_continue_reviewing(state: BaselineAgentState):
if state["current_file_index"] >= len(state["files"]):
return "generate_report"
return "review_file"
baseline_builder = StateGraph(BaselineAgentState)
baseline_builder.add_node("read_files", read_all_files_node)
baseline_builder.add_node("review_file", review_file_node)
baseline_builder.add_node("generate_report", generate_report_node)
baseline_builder.set_entry_point("read_files")
baseline_builder.add_edge("read_files", "review_file")
baseline_builder.add_conditional_edges(
"review_file",
should_continue_reviewing,
{"review_file": "review_file", "generate_report": "generate_report"}
)
baseline_builder.add_edge("generate_report", END)
baseline_graph = baseline_builder.compile()
Benchmark: Run Baseline and Record
Run the baseline on a representative 8-file Python project and record:
| Metric | Baseline Value |
|---|---|
| Total input tokens | ~85,000 |
| Total output tokens | ~12,000 |
| Total tokens | ~97,000 |
| Iterations | 10 (8 reviews + 1 read + 1 report) |
| Completion time | ~45 seconds |
Tip: Before optimizing, run your baseline workflow on at least three different representative tasks and average the token counts. Optimizations that look impressive on one task may vary significantly on others. Use the average as your baseline target to beat.
Optimization 1: Focused System Prompt (Target: -15% tokens)
The baseline system prompt is 1,847 tokens. It is re-sent on every LLM call. For 10 calls, that is 18,470 tokens just for the system prompt.
Replace it with a focused, role-specific prompt:
FOCUSED_SYSTEM_PROMPT = """
You are a Python code reviewer. For each file provided, return a JSON object:
{
"file": "<filename>",
"issues": [
{"severity": "high|medium|low", "category": "security|performance|quality|testing",
"line": <line_number>, "description": "<concise description>", "fix": "<specific fix>"}
],
"summary": "<2-3 sentence overall assessment>"
}
Focus on actionable issues only. Omit style issues unless they affect readability significantly.
"""
Measure the saving:
System prompt tokens saved: (1,847 - 287) × 10 calls = 15,600 tokens saved
Percentage of baseline: 15,600 / 97,000 = 16.1%
Running total saved: ~16%
Optimization 2: File-by-File Scoped Sessions (Target: -25% additional)
The baseline sends ALL files to every review iteration. Replace this with scoped sessions — each file is reviewed in complete isolation.
def review_single_file_scoped(file_path: str, file_content: str) -> dict:
"""Review exactly one file in a fresh, scoped context."""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
messages = [
SystemMessage(content=FOCUSED_SYSTEM_PROMPT), # 287 tokens (not 1,847)
HumanMessage(content=f"""
Review this file and return a JSON findings object.
File: {file_path}
```python
{file_content}
```
""")
]
# Token cost: 287 (sys) + ~file_tokens + ~200 (task) = scoped
# NOT: 1,847 + all_other_files + all_prior_comments
response = llm.invoke(messages)
return parse_json_response(response.content)
def scoped_review_all_files(files: list[dict]) -> list[dict]:
return [
review_single_file_scoped(f["path"], f["content"])
for f in files
]
Measure the saving vs. baseline:
Baseline per-iteration context:
- System prompt: 1,847 tokens
- All 8 files (avg 1,200 tokens each): 9,600 tokens
- Accumulating prior comments: 0 to 8,000 tokens
Average per iteration: ~15,300 tokens input
× 8 iterations: ~122,400 tokens
Optimized per-iteration context:
- System prompt: 287 tokens
- Single file (avg 1,200 tokens): 1,200 tokens
- Task spec: 200 tokens
Average per iteration: ~1,687 tokens input
× 8 iterations: ~13,500 tokens
Input token savings from scoping: ~108,900 tokens
(Replaces previous estimate — baseline calculation adjusted for actual accumulation)
Key insight: running total savings now ~40%+ just from Opts 1 and 2
Tip: The scoped session pattern (Optimization 2) is consistently the highest-impact optimization for file-processing agents. The reason is mathematical: baseline sends N files × average_file_size tokens on each of N iterations = O(N²) cost, while scoped sends 1 file per iteration = O(N) cost. For 8 files, that is a 7x reduction in file-content tokens alone.
Optimization 3: Structured Output with Compressed Handoffs (Target: -10% additional)
The baseline review comments are verbose prose (avg 800 tokens each), and they accumulate in state. Replace with the structured JSON output already introduced in Optimization 1.
Now enforce that only structured summaries (not full prose) are passed forward:
class OptimizedState(TypedDict):
repo_path: str
files_to_process: List[dict] # Consumed one at a time, not stored en masse
findings: List[dict] # Compact JSON findings (avg ~120 tokens each)
final_report: str
token_ledger: dict
def accumulate_findings_node(state: OptimizedState) -> OptimizedState:
"""Process next file, store only compact findings."""
if not state["files_to_process"]:
return state
next_file = state["files_to_process"].pop(0)
finding = review_single_file_scoped(next_file["path"], next_file["content"])
# Store the compact finding (~120 tokens), not the verbose output (~800 tokens)
state["findings"].append(finding)
return state
For the report generation step, the orchestrator now receives only compact findings:
def generate_optimized_report(findings: list[dict]) -> str:
"""Generate report from structured findings — minimal context."""
import json
llm = ChatOpenAI(model="gpt-4o", temperature=0)
findings_json = json.dumps(findings, indent=2)
# findings_json is ~960 tokens (8 findings × 120 tokens)
# vs. baseline: ~6,400 tokens of verbose comments
messages = [
SystemMessage(content=FOCUSED_SYSTEM_PROMPT),
HumanMessage(content=f"""
Produce a code review report from these structured findings.
Group by severity. Include an executive summary and prioritized action items.
Findings:
{findings_json}
""")
]
response = llm.invoke(messages)
return response.content
Token saving from this optimization:
Report generation input:
Baseline: 1,847 (sys) + 9,600 (all files) + 6,400 (verbose comments) = 17,847
Optimized: 287 (sys) + 960 (compact findings) + 300 (task) = 1,547
Savings: 16,300 tokens on report generation alone
Across all iterations, structured handoffs save:
(800 - 120) tokens × 8 files = 5,440 tokens in accumulated state
Not re-sent each iteration (scoped), but saves on final report step
Running cumulative savings vs. full baseline: ~47%
Optimization 4: Tool Result Caching (Target: -5% on repeated runs)
For the file reading step, implement caching based on file path + modification time:
import os
import hashlib
file_content_cache = {}
def read_file_cached(filepath: str) -> str:
"""Read file with caching based on path + mtime."""
mtime = os.path.getmtime(filepath)
cache_key = f"{filepath}:{mtime}"
if cache_key not in file_content_cache:
with open(filepath, 'r') as f:
file_content_cache[cache_key] = f.read()
return file_content_cache[cache_key]
review_result_cache = {}
def review_with_cache(file_path: str, file_content: str) -> dict:
"""Review file with result caching — skip if file unchanged."""
content_hash = hashlib.md5(file_content.encode()).hexdigest()
cache_key = f"review:{content_hash}"
if cache_key in review_result_cache:
print(f"Cache hit for {file_path}")
return review_result_cache[cache_key]
result = review_single_file_scoped(file_path, file_content)
review_result_cache[cache_key] = result
return result
On the first run, caching provides no token savings (cold cache). On re-runs (e.g., re-running the review after fixing one file), caching eliminates all LLM calls for unchanged files.
For a project where 2 of 8 files changed between runs:
Without cache: 8 review calls × ~1,687 tokens = 13,496 tokens
With cache: 2 review calls × ~1,687 tokens = 3,374 tokens
Savings on re-run: 10,122 tokens (75% reduction for re-runs)
Optimization 5: Early Termination on Empty Findings (Target: -3% average)
Add exit conditions that skip the full review loop for trivial files:
def quick_triage(file_path: str, file_content: str) -> str:
"""Fast triage to classify file before full review."""
line_count = len(file_content.splitlines())
# Skip files that are trivially simple
if line_count < 15:
return "skip" # Config files, __init__.py, etc.
# Skip files that are likely auto-generated
autogen_markers = ["# AUTO-GENERATED", "# DO NOT EDIT", "# Generated by"]
if any(marker in file_content[:500] for marker in autogen_markers):
return "skip"
return "review"
def optimized_review_loop(files: list[dict]) -> list[dict]:
"""Review loop with early termination for trivial files."""
findings = []
skipped = 0
for file_info in files:
triage_result = quick_triage(file_info["path"], file_info["content"])
if triage_result == "skip":
skipped += 1
findings.append({
"file": file_info["path"],
"issues": [],
"summary": "File skipped (trivial or auto-generated)",
"skipped": True
})
continue
finding = review_with_cache(file_info["path"], file_info["content"])
findings.append(finding)
print(f"Reviewed {len(files) - skipped} files, skipped {skipped}")
return findings
In a typical project with 8 Python files, 1–2 are typically trivial (__init__.py, config constants):
2 files skipped × 1,687 tokens = 3,374 tokens saved
As percentage of optimized total: ~4%
Tip: The triage step itself costs tokens (you are running logic on the file). Keep triage logic purely in Python (no LLM call) wherever possible. Line count thresholds, marker string searches, and file name patterns are all computable without LLM tokens. Only use a lightweight LLM call for triage if the Python-only criteria are insufficient for your use case.
Optimization 6: Parallel Execution of Independent Reviews (Latency Reduction)
While not directly reducing total tokens, parallelizing independent file reviews reduces wall-clock time significantly and can be combined with streaming to improve user-perceived performance:
import asyncio
from langchain_openai import ChatOpenAI
async def review_file_async(file_info: dict) -> dict:
"""Async version of scoped file review."""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
messages = [
SystemMessage(content=FOCUSED_SYSTEM_PROMPT),
HumanMessage(content=f"File: {file_info['path']}\n\n```python\n{file_info['content']}\n```")
]
response = await llm.ainvoke(messages)
return parse_json_response(response.content)
async def parallel_review(files: list[dict]) -> list[dict]:
"""Review all files in parallel — same total tokens, much less latency."""
# Batch to avoid rate limits: process 3 files at a time
batch_size = 3
all_findings = []
for i in range(0, len(files), batch_size):
batch = files[i:i + batch_size]
batch_results = await asyncio.gather(*[
review_file_async(f) for f in batch
])
all_findings.extend(batch_results)
return all_findings
Baseline time (sequential, no parallelism): ~45 seconds
Optimized time (parallel batches of 3): ~18 seconds
Time reduction: ~60%
Token count: unchanged (parallelism doesn't reduce tokens)
Putting It All Together: The Fully Optimized Workflow
import asyncio
import glob
from typing import TypedDict, List
class OptimizedReviewState(TypedDict):
repo_path: str
file_findings: List[dict]
final_report: str
token_metrics: dict
def build_optimized_review_workflow():
"""Assembles all 6 optimizations into a complete workflow."""
async def run_optimized_review(repo_path: str) -> dict:
# Step 1: Discover files
all_files = []
for filepath in glob.glob(f"{repo_path}/**/*.py", recursive=True):
content = read_file_cached(filepath)
all_files.append({"path": filepath, "content": content})
# Step 2: Triage (Python-only, no LLM tokens)
review_queue = [f for f in all_files
if quick_triage(f["path"], f["content"]) == "review"]
# Step 3: Parallel scoped reviews (Optimizations 1, 2, 3, 5)
findings = await parallel_review(review_queue)
# Step 4: Compact report generation (Optimization 3)
final_report = generate_optimized_report(findings)
return {
"findings": findings,
"final_report": final_report
}
return run_optimized_review
Final Measurement: Comparing Baseline vs. Optimized
Run both workflows on the same 8-file benchmark project:
| Metric | Baseline | Optimized | Change |
|---|---|---|---|
| System prompt tokens | 18,470 | 2,870 | -84.4% |
| File content tokens (input) | 62,400 | 9,600 | -84.6% |
| Accumulated comment tokens | 14,400 | 960 | -93.3% |
| Report generation tokens | 17,847 | 1,547 | -91.3% |
| Total input tokens | 85,000 | 14,977 | -82.4% |
| Output tokens | 12,000 | 4,800 | -60.0% |
| Total tokens | 97,000 | 19,777 | -79.6% |
| Wall-clock time | ~45 sec | ~18 sec | -60% |
| Output quality | Baseline | Equivalent | — |
The optimized workflow delivers a 79.6% token reduction — far exceeding the 40% target — while producing equivalent output quality. The key contributors:
- Focused system prompt: -16% of baseline total
- Scoped sessions (no cross-file accumulation): -45% of baseline total
- Structured handoffs: -12% of baseline total
- Early triage termination: -4% of baseline total
Tip: The 40% target in this exercise is intentionally conservative. Well-applied optimizations on typical agentic workflows routinely produce 60–80% token reductions. The 40% framing is useful for stakeholder conversations — it is a credible, conservative floor. When you achieve 70%+, you have room to trade some efficiency back for additional quality checks or more thorough analysis while still meeting the 40% target.
Exercise Variations for Different Personas
For Software Engineers
Extend the optimized workflow to handle multi-language repositories (Python + JavaScript + Go). The key constraint: each language's review agent should have a language-specific scoped system prompt, not a generic polyglot prompt. Measure whether the per-language specialization produces better findings quality vs. a single generic reviewer.
For QA Engineers
Adapt the pattern to an automated test generation agent. The baseline generates tests for all functions in one large session. Optimize it using the same six techniques, with the additional constraint that the test runner results must be incorporated into the review loop (making caching conditional on test pass/fail status).
For Product Managers
The same six optimizations apply to a product requirements analysis agent. Baseline: upload a 50-page PRD and ask the agent to extract all requirements, identify gaps, and prioritize features — in one large session. Optimized: decompose into section-by-section extraction (scoped), compact JSON requirement objects (structured handoffs), and a final synthesis from compact requirements (thin orchestrator). Document the before/after token costs and present them as an ROI argument for adopting the optimized approach.
Troubleshooting Common Issues in the Optimization Process
Issue: Structured JSON output is malformed 15% of the time
Solution: Use a schema-enforced output parser with retry logic. Do not simply json.loads() the response.
from langchain_core.output_parsers import JsonOutputParser
from langchain.output_parsers import RetryOutputParser
base_parser = JsonOutputParser(pydantic_object=FindingsSchema)
retry_parser = RetryOutputParser.from_llm(parser=base_parser, llm=llm)
Issue: Cache hit rate is lower than expected
Solution: Add argument normalization before computing cache keys. File paths that differ only in ./ prefix or absolute vs. relative form are common sources of false cache misses.
Issue: Parallel requests are hitting rate limits
Solution: Implement exponential backoff and reduce batch size. Start with batch_size=2 for GPT-4 class models and batch_size=5 for faster/cheaper models.
Issue: Some files are being incorrectly skipped by triage
Solution: Log all triage decisions and review the skip list after each run. Adjust the line count threshold based on your codebase's characteristics — the 15-line threshold may be too aggressive for some projects.
Tip: When deploying the optimized workflow to production, implement a "shadow mode" for the first two weeks: run both the baseline and optimized workflows on the same inputs, compare outputs, and measure quality divergence. Acceptable quality thresholds depend on your use case (security review agents should have near-zero quality divergence; style review agents can tolerate more). Shadow mode builds organizational trust in the optimization and catches edge cases before they affect users.
Summary
This hands-on exercise demonstrated that applying the techniques from this module in sequence produces dramatic token reductions well beyond the 40% target:
- Optimization 1 (focused system prompt): eliminates the largest fixed per-iteration tax
- Optimization 2 (scoped sessions): eliminates the quadratic accumulation of cross-file context
- Optimization 3 (structured handoffs): compresses inter-step communication by 85%+
- Optimization 4 (caching): eliminates redundant work on re-runs
- Optimization 5 (early termination): skips trivially simple files
- Optimization 6 (parallelism): reduces latency without changing token count
The optimizations are compositional — each one builds on the others. Applied together, they transform an unoptimized agentic workflow from a token-expensive prototype into a production-ready system that delivers the same quality results at a fraction of the cost.
The measurement framework established in Step 0 (baseline instrumentation) is as important as any individual optimization. Without it, you are optimizing by intuition. With it, you are making data-driven architectural decisions — the engineering foundation for sustainable, cost-efficient agentic systems.