The Problem: Agents That Re-Read What They Already Know
In a well-observed agentic system, one of the most striking inefficiencies is this: the same file gets read three times across three iterations because the agent "forgets" it already read it. The same search query is issued twice because the second planning step did not recognize it as a repeat of the first. The same code snippet is analyzed in multiple contexts because no one told the agent it had already been analyzed.
This is not a model intelligence failure. It is an architectural failure. The model does not have persistent state outside of its context window. If the fact "file X was already read and the result was Y" is not in the current context, the model will read file X again.
Result caching and deduplication add a persistent memory layer outside the context window, enabling agents to avoid re-executing work they have already done. This directly reduces tool call frequency, tool result injection, and the token bloat that comes with it.
Two Levels of Caching in Agentic Systems
Caching in agentic systems operates at two distinct levels, and confusing them leads to incomplete solutions:
Level 1: LLM Response Caching
Cache the LLM's own output for identical inputs. If two agent calls have the same system prompt, history, and user message, return the cached response instead of making a new API call.
This is most relevant for:
- Repeated planning steps with identical state
- Verification steps that re-run on unchanged outputs
- Tool schema transmission (same tools, same hashing)
import hashlib
import json
class LLMResponseCache:
def __init__(self, backend="redis"):
self.backend = get_cache_backend(backend)
def _compute_key(self, messages: list, tools: list) -> str:
# Deterministic hash of the full prompt payload
payload = json.dumps({
"messages": [m.dict() for m in messages],
"tools": tools
}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
def get_or_invoke(self, messages: list, tools: list, llm) -> str:
key = self._compute_key(messages, tools)
cached = self.backend.get(key)
if cached:
return json.loads(cached), True # (response, cache_hit)
response = llm.invoke(messages, tools=tools)
self.backend.setex(
key,
3600, # 1 hour TTL
json.dumps(response.dict())
)
return response, False
Note: LLM response caching has significant caveats — it is only safe when the same inputs should always produce the same outputs. Avoid it for creative tasks or tasks where the model's randomness is intentional.
Level 2: Tool Result Caching
Cache the outputs of tool executions so that if the same tool is called with the same arguments, the cached result is returned instead of re-executing the tool. This is the higher-value and safer caching layer for most agentic systems.
class ToolResultCache:
def __init__(self):
self.cache = {} # In-memory for single-session; use Redis for cross-session
def _key(self, tool_name: str, args: dict) -> str:
normalized = json.dumps(args, sort_keys=True)
return f"{tool_name}:{hashlib.md5(normalized.encode()).hexdigest()}"
def get(self, tool_name: str, args: dict) -> tuple[str | None, bool]:
key = self._key(tool_name, args)
return self.cache.get(key), key in self.cache
def set(self, tool_name: str, args: dict, result: str) -> None:
key = self._key(tool_name, args)
self.cache[key] = result
def execute_with_cache(self, tool_name: str, args: dict,
executor: callable) -> tuple[str, bool]:
cached_result, hit = self.get(tool_name, args)
if hit:
return cached_result, True # cache hit
result = executor(tool_name, args)
self.set(tool_name, args, result)
return result, False # cache miss
Tip: Always cache at the tool result level before considering LLM response caching. Tool result caching is unconditionally safe (deterministic operations like reading a file produce the same output for the same input), while LLM response caching requires careful consideration of whether the use case requires determinism. Tool result caching alone typically eliminates 20–40% of total tool calls in iterative agentic loops.
Identifying Cacheable vs. Non-Cacheable Operations
Not all tool operations should be cached. Here is a decision framework:
Is the operation deterministic?
→ Does the same input always produce the same output?
→ Will the underlying resource change during the agent session?
CACHEABLE (safe to cache):
✓ File reads (if files don't change during session)
✓ Code parsing / AST analysis
✓ Static documentation lookup
✓ Schema introspection (database schema, API schema)
✓ Grep/search operations on static codebases
✓ Token counting and embedding computation
NOT CACHEABLE (do not cache):
✗ Web searches (results change over time)
✗ Database queries on mutable data
✗ API calls to live services
✗ Tool calls that produce side effects (write file, send email)
✗ Time-sensitive operations (get current date, check service health)
CONDITIONALLY CACHEABLE:
~ Git operations: cacheable per commit hash, not per path
~ Test execution results: cacheable per exact code version
~ AI model calls to sub-agents: cacheable if temperature=0
Implement this as a cache policy configuration:
CACHE_POLICY = {
"read_file": {"cacheable": True, "ttl": 3600},
"search_code": {"cacheable": True, "ttl": 3600},
"get_schema": {"cacheable": True, "ttl": 7200},
"run_tests": {"cacheable": True, "ttl": 300}, # Short TTL — code may change
"web_search": {"cacheable": False},
"write_file": {"cacheable": False},
"run_command": {"cacheable": False},
"get_current_time": {"cacheable": False}
}
def execute_tool_with_policy(tool_name: str, args: dict) -> str:
policy = CACHE_POLICY.get(tool_name, {"cacheable": False})
if policy["cacheable"]:
result, hit = tool_cache.execute_with_cache(tool_name, args, raw_tool_executor)
log_cache_event(tool_name, hit)
return result
else:
return raw_tool_executor(tool_name, args)
Tip: For software engineers: annotate your tool functions with cache policy metadata directly in the tool definition. This keeps the caching intent co-located with the tool implementation and makes it visible to code reviewers. A simple @cacheable(ttl=3600) decorator on tool functions is both self-documenting and directly usable to configure your cache layer.
Deduplication: Detecting and Suppressing Redundant Operations
Caching handles the case where the exact same tool call is made twice. Deduplication handles a subtler case: semantically equivalent but syntactically different requests.
Examples of semantic duplicates that caching alone misses:
- read_file("src/auth.py") vs. read_file("./src/auth.py") (path normalization)
- search_code("authenticate user") vs. search_code("user authentication") (semantic equivalence)
- Reading lines 1-100 of a file, then reading lines 50-150 (overlapping range)
Path and Argument Normalization
import os
def normalize_file_path(path: str) -> str:
"""Normalize file paths to canonical form for cache key computation."""
return os.path.realpath(os.path.abspath(path))
def normalize_search_query(query: str) -> str:
"""Normalize search queries: lowercase, sort terms, strip punctuation."""
import re
terms = re.findall(r'\w+', query.lower())
return ' '.join(sorted(terms))
def normalize_args(tool_name: str, args: dict) -> dict:
normalizers = {
"read_file": lambda a: {**a, "path": normalize_file_path(a["path"])},
"search_code": lambda a: {**a, "query": normalize_search_query(a["query"])},
"read_file_range": lambda a: normalize_file_range(a)
}
normalizer = normalizers.get(tool_name)
return normalizer(args) if normalizer else args
Semantic Deduplication with Embeddings
For search-type tools, use embedding similarity to detect semantically redundant queries:
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticDeduplicator:
def __init__(self, similarity_threshold: float = 0.92):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.executed_queries = [] # (query, embedding, result)
self.threshold = similarity_threshold
def find_duplicate(self, new_query: str) -> str | None:
if not self.executed_queries:
return None
new_embedding = self.model.encode(new_query)
for prior_query, prior_embedding, prior_result in self.executed_queries:
similarity = np.dot(new_embedding, prior_embedding) / (
np.linalg.norm(new_embedding) * np.linalg.norm(prior_embedding)
)
if similarity >= self.threshold:
print(f"Dedup: '{new_query}' ≈ '{prior_query}' (sim={similarity:.2f})")
return prior_result
return None
def record(self, query: str, result: str) -> None:
embedding = self.model.encode(query)
self.executed_queries.append((query, embedding, result))
deduplicator = SemanticDeduplicator(similarity_threshold=0.92)
def deduplicated_search(query: str) -> str:
cached_result = deduplicator.find_duplicate(query)
if cached_result:
return f"[CACHED] {cached_result}"
result = execute_search(query)
deduplicator.record(query, result)
return result
Tip: Tune the similarity threshold for semantic deduplication carefully. Too high (0.99) and you miss obvious duplicates. Too low (0.80) and you return cached results for genuinely different queries. A threshold of 0.90–0.93 works well for most technical search queries. Instrument your deduplication layer to log similarity scores during testing so you can calibrate for your specific domain.
Cross-Session Caching: Persisting Results Between Agent Runs
Within a single agent session, in-memory caching is sufficient. But many agentic workflows involve recurring tasks — the same codebase is analyzed daily, the same documentation is queried repeatedly, the same test suite is run across multiple iterations of the same feature.
Cross-session caching persists tool results in durable storage and reuses them across separate agent invocations.
import redis
import hashlib
import json
from datetime import datetime
class PersistentToolCache:
def __init__(self, redis_url: str):
self.client = redis.from_url(redis_url)
def _key(self, tool_name: str, args: dict, version_tag: str = "") -> str:
payload = json.dumps(
{"tool": tool_name, "args": args, "version": version_tag},
sort_keys=True
)
return f"tool_cache:{hashlib.sha256(payload.encode()).hexdigest()[:16]}"
def get(self, tool_name: str, args: dict, version_tag: str = "") -> dict | None:
key = self._key(tool_name, args, version_tag)
raw = self.client.get(key)
if raw:
entry = json.loads(raw)
return entry
return None
def set(self, tool_name: str, args: dict, result: str,
ttl: int = 3600, version_tag: str = "") -> None:
key = self._key(tool_name, args, version_tag)
entry = {
"result": result,
"cached_at": datetime.utcnow().isoformat(),
"tool": tool_name
}
self.client.setex(key, ttl, json.dumps(entry))
For codebase analysis agents, the version tag pattern is especially useful: use the git commit hash as the version tag. This ensures cached results are valid only for the specific commit being analyzed.
import subprocess
def get_git_commit_hash(repo_path: str) -> str:
result = subprocess.run(
["git", "rev-parse", "HEAD"],
cwd=repo_path, capture_output=True, text=True
)
return result.stdout.strip()
commit_hash = get_git_commit_hash("/path/to/repo")
cache.set(
tool_name="analyze_file",
args={"path": "src/auth.py"},
result=analysis_result,
ttl=86400, # 24 hours
version_tag=commit_hash # Invalidate when code changes
)
Tip: For product managers: cross-session tool result caching is the technical foundation of "incremental analysis" — the ability to re-run an agentic analysis workflow and only process what has changed since the last run. When proposing agentic features, frame this as a "smart re-run" capability: the first run analyzes everything; subsequent runs only process what's new or changed. This dramatically reduces both cost and runtime for recurring workflows like daily security scans, weekly documentation updates, or sprint-by-sprint test coverage analysis.
The Tool Result Summary Pattern
Even when caching works perfectly, re-injecting a cached file read result (2,000 tokens) into every subsequent iteration is still expensive. The Tool Result Summary pattern compresses cached results before injecting them:
RESULT_SUMMARY_CACHE = {} # Stores compressed versions of tool results
def get_compressed_result(tool_name: str, args: dict,
full_result: str, context_budget: int) -> str:
"""Return a compressed version of a tool result if the full version is too large."""
full_token_count = count_tokens(full_result)
if full_token_count <= context_budget:
return full_result # Full result fits — no compression needed
# Check if we already have a compressed version
cache_key = f"compressed:{tool_name}:{hash(full_result)}"
if cache_key in RESULT_SUMMARY_CACHE:
return RESULT_SUMMARY_CACHE[cache_key]
# Compress the result
compression_prompt = f"""
This is the output of a {tool_name} operation.
Compress it to under {context_budget} tokens while preserving all
technically significant information. Remove boilerplate, comments
from standard library code, and repeated patterns.
Output:
{full_result[:8000]}
"""
compressed = llm.invoke([HumanMessage(content=compression_prompt)]).content
RESULT_SUMMARY_CACHE[cache_key] = compressed
return compressed
The first time a tool result is used, the full result is available. For all subsequent iterations, the compressed version (produced once at a small LLM cost) is used instead of the full result.
Tip: The threshold for triggering compression should be relative to your context budget, not a fixed number. A result that is "too large" for an agent with a 2,000-token context budget may be perfectly fine for an agent with a 10,000-token budget. Configure your compression threshold as a percentage of available context (e.g., "compress any result that would consume more than 25% of the remaining context budget").
Measuring Cache Effectiveness
from dataclasses import dataclass, field
@dataclass
class CacheMetrics:
total_tool_calls: int = 0
cache_hits: int = 0
tokens_saved: int = 0
dedup_hits: int = 0
@property
def hit_rate(self) -> float:
if self.total_tool_calls == 0:
return 0
return (self.cache_hits + self.dedup_hits) / self.total_tool_calls
def report(self):
print(f"Tool calls: {self.total_tool_calls}")
print(f"Cache hits: {self.cache_hits} ({self.cache_hits/max(1,self.total_tool_calls)*100:.1f}%)")
print(f"Dedup hits: {self.dedup_hits}")
print(f"Total tokens saved: {self.tokens_saved:,}")
print(f"Overall avoidance rate: {self.hit_rate*100:.1f}%")
Target benchmarks based on task type:
- Codebase analysis tasks: expect 40–60% cache hit rate (many files read multiple times)
- Documentation generation: expect 20–35% hit rate (more unique operations)
- Test automation: expect 50–70% hit rate (same test infrastructure queried repeatedly)
- Data processing pipelines: expect 30–50% hit rate
Tip: Report cache hit rates alongside token consumption in your agentic system's observability dashboards. A drop in cache hit rate that is not explained by new task types is often a sign that prompts have been modified in ways that cause the agent to phrase tool calls differently (breaking cache key matching), or that a normalization function is missing. Treat cache hit rate as a first-class reliability metric.
Summary
Result caching and deduplication attack the problem of repeated work from two directions: caching stores and reuses exact-match results, while deduplication detects and prevents semantically equivalent repeat operations. Together they typically eliminate 20–50% of tool calls in iterative agentic workflows.
The cache policy framework (cacheable vs. non-cacheable vs. conditionally cacheable) prevents caching from introducing stale data bugs. Cross-session caching with version tags (git commit hashes for code analysis) extends these savings across multiple agent runs. The Tool Result Summary pattern addresses the companion problem of repeated re-injection of large cached results.
With caching and deduplication in place, your agentic system stops paying for the same information twice — one of the most reliable and highest-ROI optimizations available at the loop level.