Tool schemas consume tokens when they go into the model. Tool results consume tokens when they come back. Both sides of the tool use cycle have optimization potential, but tool results are often the larger problem: while a tool schema has a fixed cost you can measure at deployment time, tool results are dynamic and can vary from a few bytes to hundreds of kilobytes depending on what the tool fetches.
An agent that calls a database query tool might get back 50 rows when it needed 3. An agent that calls a file reading tool might get back 2,000 lines of code when it needed the function at line 150. An agent that calls a web search tool might get back full HTML pages when it needed a few key facts. Every byte of tool output that reaches the model's context window is a token you pay for — and unlike system prompts, tool results cannot be cached between requests.
This topic gives you a complete framework for managing tool output token costs: how to measure them, where the waste hides, and the specific techniques — trimming, filtering, summarization, and structured extraction — that eliminate it.
Understanding How Tool Results Enter the Context
When a model makes a tool call, your agent code is responsible for executing the tool and returning the result. The result is added to the conversation history as a tool role message (OpenAI) or a tool_result content block (Anthropic). From that point forward, the result exists in the conversation history and is re-sent to the model on every subsequent turn for the remainder of the session.
This is the "result persistence problem": a large tool result does not cost tokens once, it costs tokens every single turn until the conversation ends or the context window is trimmed.
Consider a typical agentic coding session:
- Turn 1: User asks to review a PR. Agent calls get_pr_diff(), which returns a 4,000-token diff.
- Turn 2: Agent calls search_codebase() to find related files. Returns 1,500 tokens of results.
- Turn 3: Agent calls read_file() on a specific file. Returns 800 tokens.
- Turn 4: Agent calls run_tests(). Returns 600 tokens of test output.
By turn 5, the accumulated tool results alone are 6,900 tokens — all of which are re-sent to the model. By turn 10, if more tools are called, this compounds further.
The optimization principle is clear: minimize the token cost of each tool result at the moment it is returned, because that cost will be repeated.
Tip: Add a "tool result budget" to your agent implementation — a maximum number of tokens a single tool result is allowed to contribute to the conversation history. Anything above the budget triggers automatic summarization before the result is appended. Start with a budget of 500–1,000 tokens per result and tune based on task accuracy.
Technique 1: Result Trimming — Hard Limits and Pagination
The simplest optimization is also the most commonly overlooked: enforce hard size limits on tool results and paginate when more data is genuinely needed.
Implement result size limits at the tool layer:
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def trim_result(result: str, max_tokens: int = 800) -> str:
"""Trim a tool result to a maximum token count."""
if count_tokens(result) <= max_tokens:
return result
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode(result)
trimmed_tokens = tokens[:max_tokens]
trimmed_text = enc.decode(trimmed_tokens)
# Add a clear truncation notice
return trimmed_text + f"\n\n[TRUNCATED: result exceeded {max_tokens} token limit. Use pagination parameters to retrieve more.]"
class FileReaderTool:
def execute(self, path: str, start_line: int = 1, end_line: int = None) -> str:
with open(path, 'r') as f:
lines = f.readlines()
# Apply line range if specified
if end_line:
lines = lines[start_line - 1:end_line]
else:
# Default: first 100 lines only
lines = lines[start_line - 1:start_line + 99]
result = f"File: {path} (lines {start_line}-{start_line + len(lines) - 1} of {sum(1 for _ in open(path))})\n"
result += "".join(lines)
return trim_result(result, max_tokens=600)
class DatabaseQueryTool:
def execute(self, query: str, max_rows: int = 10) -> str:
"""Execute query with automatic result limiting."""
results = self.db.execute(query + f" LIMIT {max_rows}")
if not results:
return "Query returned no results."
# Format as compact JSON
import json
output = {
"row_count": len(results),
"data": results[:max_rows],
"truncated": len(results) == max_rows # Signal if there may be more
}
return trim_result(json.dumps(output, indent=None), max_tokens=500)
Design tools with pagination from the start:
{
"name": "list_issues",
"description": "List GitHub issues with optional filtering. Returns paginated results.",
"input_schema": {
"type": "object",
"properties": {
"state": {"type": "string", "enum": ["open", "closed", "all"]},
"limit": {"type": "integer", "description": "Max results (default 10, max 50)"},
"offset": {"type": "integer", "description": "Pagination offset (default 0)"},
"fields": {
"type": "array",
"items": {"type": "string"},
"description": "Fields to include: id, title, state, labels, assignee, created_at"
}
},
"required": []
}
}
With a fields parameter, the model can request only the data it needs rather than receiving the full GitHub issue object (which can be 2,000+ tokens for a complex issue with embedded comments).
Tip: Design every tool that returns collections to accept a limit and fields parameter by default. The model is good at specifying which fields it needs when the option is available. Prompting the model to use field selection — "when calling tools that return structured data, request only the fields you will actually use" — reduces average tool result size by 40–60% for data-heavy tools.
Technique 2: Result Filtering — Returning Only Relevant Content
Trimming reduces size uniformly. Filtering reduces size intelligently by removing content that is structurally irrelevant to the agent's current task.
Server-side filtering in tool implementation:
import re
from typing import Optional
class CodeSearchTool:
def execute(
self,
query: str,
include_context_lines: int = 2,
max_results: int = 5
) -> str:
"""Search codebase and return only matching excerpts, not full files."""
all_matches = self._search(query)
if not all_matches:
return f"No results found for query: {query}"
# Return only the matches with minimal context, not full file contents
results = []
for match in all_matches[:max_results]:
file_path = match["file"]
line_num = match["line"]
lines = match["file_lines"]
start = max(0, line_num - include_context_lines - 1)
end = min(len(lines), line_num + include_context_lines)
excerpt = {
"file": file_path,
"line": line_num,
"code": "".join(lines[start:end]).strip()
}
results.append(excerpt)
# Compact JSON representation
import json
return json.dumps({
"total_matches": len(all_matches),
"shown": len(results),
"results": results
}, indent=None, separators=(',', ':'))
class LogAnalysisTool:
"""Filter log output to remove noise before returning to the model."""
NOISE_PATTERNS = [
r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ DEBUG', # Debug lines
r'Health check (OK|passed)',
r'Cache (hit|miss) for',
r'^\s*$', # Empty lines
]
def execute(self, log_file: str, level: str = "ERROR", last_n_lines: int = 100) -> str:
with open(log_file) as f:
lines = f.readlines()[-last_n_lines:]
# Filter to requested log level and above
level_priority = {"DEBUG": 0, "INFO": 1, "WARNING": 2, "ERROR": 3, "CRITICAL": 4}
min_priority = level_priority.get(level, 2)
filtered_lines = []
for line in lines:
# Remove noise patterns
if any(re.search(p, line) for p in self.NOISE_PATTERNS):
continue
# Keep only lines at or above the requested level
for lvl, priority in level_priority.items():
if lvl in line and priority >= min_priority:
filtered_lines.append(line)
break
if not filtered_lines:
return f"No {level}+ log entries found in last {last_n_lines} lines."
return f"[{len(filtered_lines)} entries at {level}+ level]\n" + "".join(filtered_lines)
Tip: The most impactful filtering is often the removal of noise that the tool source always includes but the model never needs: HTTP headers in API responses, CSS and script tags in web content, whitespace and formatting in JSON/XML, metadata fields like IDs and timestamps when only values matter. Build a "relevance filter" as a middleware layer in your tool execution pipeline that strips known noise categories from specific tool types.
Technique 3: Result Summarization — Compressing Large Outputs
When trimming and filtering are not sufficient — or when the full data is genuinely needed but a summary would serve the model's reasoning just as well — use an intermediate summarization step.
The key insight is that summarization for tool results is not about information loss — it is about preserving the information the agent needs for its next reasoning step, in a much more compact form.
from anthropic import Anthropic
client = Anthropic()
SUMMARIZER_SYSTEM = """You are a result summarizer for an AI agent.
Your job is to compress tool outputs while preserving all information needed for the agent to complete its task.
Output only the compressed summary — no preamble, no commentary.
Use bullet points for lists. Use code blocks only for actual code.
Be ruthlessly concise."""
async def summarize_tool_result(
tool_name: str,
tool_result: str,
task_context: str,
max_summary_tokens: int = 300
) -> str:
"""Summarize a large tool result using a fast, cheap model."""
# Only summarize if the result is actually large
if count_tokens(tool_result) <= max_summary_tokens:
return tool_result
prompt = f"""Task context: {task_context}
Tool called: {tool_name}
Tool output ({count_tokens(tool_result)} tokens):
{tool_result}
Summarize this output, keeping only information relevant to the task context.
Target: under {max_summary_tokens} tokens."""
response = client.messages.create(
model="claude-haiku-4-5", # Use fastest, cheapest model
max_tokens=max_summary_tokens + 50,
system=SUMMARIZER_SYSTEM,
messages=[{"role": "user", "content": prompt}]
)
summary = response.content[0].text
# Add a note indicating summarization occurred
return f"[Summarized from {count_tokens(tool_result)}-token output]\n{summary}"
async def execute_tool_with_compression(
tool_name: str,
tool_input: dict,
task_context: str,
compression_threshold: int = 500
) -> str:
"""Execute a tool and compress large results before adding to context."""
raw_result = await execute_tool(tool_name, tool_input)
result_tokens = count_tokens(raw_result)
if result_tokens > compression_threshold:
print(f"Tool '{tool_name}' returned {result_tokens} tokens. Compressing...")
compressed = await summarize_tool_result(
tool_name, raw_result, task_context, max_summary_tokens=300
)
print(f"Compressed to {count_tokens(compressed)} tokens.")
return compressed
return raw_result
Task-aware summarization is more effective than generic summarization because it knows what to preserve:
TASK_SPECIFIC_PROMPTS = {
"code_review": "Preserve: bugs found, security issues, performance problems, code smells. Omit: style suggestions, minor formatting issues.",
"test_generation": "Preserve: function signatures, parameter types, return types, edge cases mentioned. Omit: implementation details, comments, docs.",
"bug_investigation": "Preserve: error messages, stack traces, timestamps of failures, affected components. Omit: successful operations, health checks.",
"documentation": "Preserve: public API signatures, parameter descriptions, return values. Omit: internal implementation, private methods."
}
async def task_aware_summarize(
tool_result: str,
task_type: str,
tool_name: str
) -> str:
preservation_rule = TASK_SPECIFIC_PROMPTS.get(task_type, "Preserve all key information.")
prompt = f"""Summarize this {tool_name} output concisely.
Preservation rule: {preservation_rule}
Output:
{tool_result}"""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=400,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Tip: When implementing result summarization, always preserve verbatim any error messages, stack traces, and exact code snippets — never paraphrase these. The model needs exact error text to diagnose issues and exact code to avoid introducing transcription errors. Summarize the surrounding context, but lock critical exact-match content using a "preserve verbatim" instruction to the summarizer.
Technique 4: Structured Extraction — Returning Schema-Matched Outputs
Instead of returning free-form text that the model must parse, configure tools to return structured JSON that maps precisely to what the model needs for its next reasoning step.
Structured outputs are consistently smaller than their unstructured equivalents and eliminate the model's need to parse and extract information from prose.
def get_test_results_verbose() -> str:
return """
Test Suite: UserService
========================
Running 47 tests...
PASSED: test_create_user (0.023s)
PASSED: test_update_user (0.018s)
PASSED: test_delete_user (0.021s)
FAILED: test_create_user_duplicate_email
AssertionError: Expected 409 status code, got 201
at test_create_user_duplicate_email (tests/user.test.ts:42)
PASSED: test_get_user_by_id (0.015s)
... [43 more test results]
Results: 46 passed, 1 failed
Duration: 2.341s
"""
# This free-form output: ~180 tokens
def get_test_results_structured() -> dict:
return {
"suite": "UserService",
"passed": 46,
"failed": 1,
"duration_ms": 2341,
"failures": [
{
"test": "test_create_user_duplicate_email",
"error": "Expected 409 status code, got 201",
"location": "tests/user.test.ts:42"
}
]
}
# Structured JSON: ~80 tokens — 55% smaller, more usable
For QA engineers building test automation agents, this pattern is critical. Test runner output is notoriously verbose, with each test run potentially generating thousands of lines. A structured result that returns only failure details reduces tool output by 90%+ for typical test runs with high pass rates.
class StructuredTestRunner:
def execute(self, test_pattern: str = None) -> str:
import subprocess
import json
# Run tests with JSON reporter
cmd = ["pytest", "--json-report", "--json-report-file=-"]
if test_pattern:
cmd.append(test_pattern)
proc = subprocess.run(cmd, capture_output=True, text=True)
try:
full_report = json.loads(proc.stdout)
# Extract only what the agent needs
failures = [
{
"test": t["nodeid"],
"error": t["call"]["longrepr"][:500] if "call" in t else "Setup failed",
"duration": round(t["call"]["duration"], 3) if "call" in t else 0
}
for t in full_report["tests"]
if t["outcome"] == "failed"
]
summary = {
"passed": full_report["summary"]["passed"],
"failed": full_report["summary"]["failed"],
"skipped": full_report["summary"].get("skipped", 0),
"duration_ms": int(full_report["duration"] * 1000),
"failures": failures
}
return json.dumps(summary, separators=(',', ':'))
except json.JSONDecodeError:
# Fallback: return trimmed stderr for debugging
return trim_result(proc.stderr, max_tokens=300)
Tip: For every tool that currently returns unstructured text, create a data model (Pydantic, TypeScript interface, or JSON schema) that represents only the information the agent actually uses from that tool's output. Then update the tool to return only that structure. This exercise often reveals that agents are carrying 5–10x more data in tool results than they actually reference in their reasoning.
Technique 5: Result Caching and Deduplication
Many tool calls in a session retrieve the same data repeatedly. An agent investigating a bug might call read_file on the same configuration file three times across different conversation turns. Each retrieval adds the same content to the growing context.
Implement result caching at the agent level:
from functools import lru_cache
import hashlib
import time
from typing import Optional
class ToolResultCache:
def __init__(self, ttl_seconds: int = 300):
self.cache = {}
self.ttl = ttl_seconds
def _cache_key(self, tool_name: str, tool_input: dict) -> str:
input_str = json.dumps(tool_input, sort_keys=True)
return hashlib.md5(f"{tool_name}:{input_str}".encode()).hexdigest()
def get(self, tool_name: str, tool_input: dict) -> Optional[str]:
key = self._cache_key(tool_name, tool_input)
if key in self.cache:
result, timestamp = self.cache[key]
if time.time() - timestamp < self.ttl:
return result
del self.cache[key]
return None
def set(self, tool_name: str, tool_input: dict, result: str):
key = self._cache_key(tool_name, tool_input)
self.cache[key] = (result, time.time())
CACHEABLE_TOOLS = {"read_file", "list_directory", "get_schema", "get_documentation"}
cache = ToolResultCache(ttl_seconds=600)
def execute_tool_with_caching(tool_name: str, tool_input: dict) -> tuple[str, bool]:
"""Returns (result, was_cached)."""
if tool_name in CACHEABLE_TOOLS:
cached = cache.get(tool_name, tool_input)
if cached:
return cached, True
result = execute_tool(tool_name, tool_input)
if tool_name in CACHEABLE_TOOLS:
cache.set(tool_name, tool_input, result)
return result, False
With caching, when the model calls read_file on the same path multiple times, subsequent calls do not re-execute the tool — and more importantly, the result can be referenced by a pointer in the conversation rather than re-inserted as a full content block.
Tip: Implement a "result reference" system for highly-repeated tool calls: the first call returns the full result and assigns it an ID; subsequent calls to the same tool with the same inputs return a compact reference like [Previously retrieved: read_file("config.yaml") — result #3]. This requires the model to understand the reference convention, which you establish in the system prompt. It is most effective in long agent sessions where the same files or queries appear many times.
Summary
Tool result handling is the second major axis of tool-related token optimization, and it is often more impactful than schema optimization because tool results grow dynamically and persist across conversation turns. The five core techniques — result trimming, filtering, summarization, structured extraction, and caching — each address a different source of result bloat. In practice, a mature agent implementation applies all five in combination: hard size limits catch runaway results, filtering removes structural noise, summarization compresses when needed, structured outputs enforce discipline on what data is returned, and caching prevents repeated retrieval of identical content. Together, these techniques commonly reduce tool result token costs by 60–80% compared to naive implementations that pass raw tool outputs directly to the model.