Extend AI Agents Beyond Context Windows with MCP

Why the Context Window Is a Bottleneck for AI Agents

The context window is not a soft constraint — it is a hard architectural boundary. Every token that enters the model's context carries a fixed compute cost at inference time, and the total number of tokens the model can reason over simultaneously is capped. As of mid-2026, frontier models range from 128K to over 1M token context windows. This sounds large until you encounter real developer workflows.

A typical code investigation session might include: a system prompt with tool definitions (~3K tokens), the current file under review (~1K tokens per file, 20 files = 20K), recent git log (~2K), a Sentry error trace (~1.5K), relevant test files (~5K), and a Confluence spec page (~4K). That is already 35K tokens before the conversation history accumulates. At 200K context tokens, you are 6-8 hours into a complex debugging session with several dozen tool call round-trips before the window fills. For a 128K model, you hit limits mid-investigation.

The naive solution — summarize and truncate — introduces a subtler problem: the model loses access to exact facts. A truncated stack trace becomes "an error occurred in the auth layer." A summarized database schema loses column-level detail that would have caught a type mismatch. Lossy compression of technical artifacts is a reliability hazard in AI agent workflows, not just a convenience issue.

The underlying architectural problem is that context-stuffing conflates retrieval with reasoning. Everything the model might need is loaded upfront, because once the inference call starts, the model cannot fetch new information. This is antithetical to how experienced developers actually work: they look up what they need, when they need it, rather than reading every relevant document before touching the keyboard.

MCP changes this by making external retrieval an action the model can take mid-inference, at the moment it determines information is needed. The context window is no longer a warehouse that must be pre-stocked — it becomes a working memory that is populated on demand. The model decides it needs to see the current database schema, calls the MCP tool to retrieve it, reads the relevant portion, and discards the rest. This is query-on-demand rather than pre-load-everything.

The implications compound: with MCP, the effective knowledge surface accessible to an agent in a single session is theoretically unbounded. The practical limit shifts from "what fits in the context window" to "how many tool calls are within latency and cost budget for the task."

Tips
- Benchmark context consumption per workflow type in your team's agent sessions. Many teams discover that 40-60% of their context is occupied by pre-loaded reference material that could be fetched on demand via MCP resources.
- Design your MCP resource endpoints to return focused, query-scoped responses rather than full documents. A get_schema_for_table tool that returns a single table's schema is more context-efficient than a get_full_database_schema resource that dumps everything.
- Monitor for "context bloat" patterns — when agents load the same large resource multiple times within a session because there is no caching layer. Implement server-side caching in your MCP servers for expensive reads.
- When evaluating models for agentic workflows, test with MCP-connected sessions specifically. A model with a 128K context window using on-demand MCP fetching often outperforms a 1M context window model using pre-loaded context for complex multi-system tasks.

How MCP Tools Let Agents Query External Systems on Demand

The mechanism by which MCP enables on-demand querying is worth examining precisely, because the details determine both performance characteristics and prompt engineering strategies.

When an agent encounters a point in its reasoning where external information is needed, the model emits a tool call in a structured format. The host intercepts this, routes it to the appropriate MCP client, which transmits the request to the MCP server over the negotiated transport. The server executes the query, constructs the response, and returns it. The response is injected into the conversation as a tool result message, and the model continues reasoning with the new information now in context.

Sequence diagram — on-demand resource fetch:

User: "Why is the checkout endpoint returning 500s?"

Agent reasoning: I need to see recent Sentry errors for the checkout endpoint.

[Tool call emitted]
  → MCP Client → Sentry MCP Server
  → sentry: list_issues({ project: "backend", query: "endpoint:/checkout status:unresolved" })
  ← [12 issues returned, most recent: NullPointerException in CartService.applyDiscount()]

Agent reasoning: I need the CartService source file to diagnose this.

[Tool call emitted]
  → MCP Client → GitHub MCP Server
  → github: get_file_contents({ owner: "acme", repo: "backend", path: "src/services/CartService.java" })
  ← [File contents returned]

Agent reasoning: I can see the null check is missing on line 84. Let me check when this was introduced.

[Tool call emitted]
  → github: list_commits({ path: "src/services/CartService.java", since: "2026-04-01" })
  ← [Commits returned: the null check was removed in PR #2341, 3 days ago]

Each tool call adds a round-trip, but each also adds only the specific tokens needed for that step. Contrast this with the context-stuffing approach: to do the same investigation, you would pre-load all recent Sentry issues (large), all potentially relevant source files (very large), and recent commit history (large) — most of which would be irrelevant noise that degrades model reasoning quality.

The tool selection problem is worth addressing directly: how does the model know which tool to call? Tool selection quality is a function of (1) how the tool is described in its JSON Schema description field, (2) the quality of the model's instruction-following on tool use, and (3) the agent's system prompt framing. For complex workflows, providing explicit prompt guidance about which tools to use for which scenarios significantly improves reliability.


You have access to the following MCP servers:
- sentry: Use for error investigation, stack traces, and recent issue data
- github: Use for source code lookup, commit history, PR status, and file diffs
- jira: Use for ticket details, sprint context, and acceptance criteria

Investigation workflow:
1. Always start with sentry to retrieve the error before looking at source code
2. Use github file lookup only after identifying a specific file from the stack trace
3. Cross-reference jira ticket details when the error relates to a recent feature change

Do not load entire directories. Fetch individual files only when you have a specific reason.

Tips
- Structure your MCP servers' tool descriptions to answer the implicit question: "When should I call this vs the alternatives?" Ambiguous descriptions cause the model to call the wrong tool or skip calling any tool.
- For agents that use many MCP servers, implement a "tool routing" prompt pattern: describe each server's domain and when to prefer it over alternatives. This reduces tool selection errors by 60-80% in complex multi-server sessions.
- Paginate MCP tool responses for large datasets. Return the first page and include a nextCursor field — let the agent request subsequent pages only if needed. This is far more context-efficient than returning 500 results unconditionally.
- Cache expensive external API calls at the MCP server layer with short TTLs (60-300 seconds for most dev tool APIs). The agent will often call the same resource multiple times within a session; server-side caching prevents redundant API quota consumption.

Real-World Scenarios: What Becomes Possible with MCP

The gap between "AI coding assistant" and "AI development agent" is largely the gap between answering questions about code and taking informed action on live systems. MCP is the technical enabler that closes this gap. The following scenarios are not speculative — they represent workflows that engineering teams are running in 2026.

Cross-system root cause analysis: A production alert fires. An agent with Sentry MCP, GitHub MCP, and a database metrics MCP can autonomously retrieve the error event, identify the affected code path from the stack trace, look up the git blame for the exact lines, find the PR that introduced the change, read the PR description for intent, query database query duration metrics for the same time window, and compose a root cause analysis with remediation options — in one session, without the developer switching between five browser tabs.

Prompt example:
"Production is showing elevated 500s on /api/orders/create since 14:30 UTC.
Investigate the root cause using Sentry and GitHub. Pull the relevant stack traces,
identify the code change that introduced the regression, and propose a fix.
Also check if there are related open Jira tickets."

Spec-to-code with living documentation: A product spec lives in Confluence. Instead of copy-pasting sections into chat, an agent reads the Confluence page via MCP, identifies the acceptance criteria, generates a feature scaffold, writes tests based on the criteria, and creates a GitHub PR with the Confluence page linked in the description.

Prompt example:
"Read the Confluence spec at page ID 884921 (Authentication Revamp v2).
Generate the TypeScript interfaces for the new session token format described
in section 'Token Structure', write Zod validation schemas for each, and
create unit tests covering the edge cases listed in 'Validation Rules'.
Open a GitHub PR against the feature/auth-revamp branch."

Autonomous sprint hygiene: An agent with Jira MCP reviews all tickets in the current sprint, flags tickets that have been In Progress for more than 3 days without a linked PR, identifies blocked tickets and summarizes their blockers, and posts a sprint health report as a Jira comment on the sprint epic.

Prompt example:
"Review the current active sprint in project PLAT. For each ticket:
- Check if it has a linked GitHub PR. If In Progress > 3 days without a PR, flag it.
- Check if it is blocked and summarize the blocker.
Post a sprint health summary as a comment on the sprint epic PLAT-890.
Format it as a markdown table: Ticket | Status | Age | PR Status | Blocker."

Multi-environment deployment validation: After a deploy, an agent with access to deployment API, metrics, and Sentry MCPs can automatically query error rates pre/post deploy, check latency P95 deltas, verify service health endpoints, and either approve the release or trigger a rollback — posting a status update to the relevant Slack channel via MCP.

These scenarios share a structural pattern: the agent needs to read from multiple live systems, reason over the combined data, and optionally write back to one or more systems. None of this is achievable without MCP (or its functional equivalent). Pre-loading the data makes the context unmanageable. Direct API calls require bespoke integration code per scenario. MCP provides the composable, standardized layer that makes these workflows reproducible and maintainable.

Tips
- When planning an MCP-powered workflow, map the read/write graph first: which systems does the agent need to read from, which does it need to write to, and what is the dependency order? This exposes which MCP servers you need and whether any require human approval gates before writes.
- For write-path MCP tools (anything that creates, updates, or deletes), implement explicit confirmation prompts in your host configuration. Do not automate writes to production systems without a human-in-the-loop policy for the first 90 days of any new workflow.
- Log every MCP tool call with input arguments and output summary in your observability pipeline. Cross-system workflows that fail mid-way need traces to diagnose — which step failed, what data was it working with, what did the server return.
- Start with read-only MCP scenarios and validate agent behavior thoroughly before enabling write tools. This lets you tune the prompts and trust the agent's tool selection before it has the ability to modify live data.

Performance Considerations: Latency and Token Trade-offs

MCP introduces a new performance dimension to AI agent workflows that does not exist in single-call inference: round-trip latency per tool call. In a session with 10 MCP tool calls, the end-to-end response time is inference latency + (10 × tool call round-trip latency). For workflows that feel instantaneous today, adding poorly implemented MCP tool calls can make them feel sluggish.

Latency budget per tool call: For local stdio-transport MCP servers (process running on the same machine), tool call overhead is typically 5-50ms. For remote MCP servers over HTTP, you are looking at 50-500ms per call depending on network proximity and server-side query execution time. For MCP servers that wrap third-party APIs (GitHub, Jira, Sentry), the API call itself is the dominant cost: 200-800ms per call is typical for GitHub API, 100-400ms for Sentry, 150-600ms for Jira.

A 15-step autonomous investigation workflow with GitHub + Sentry + Jira MCP calls can easily take 60-90 seconds of wall-clock time. This is not a bug — it reflects the same 60-90 seconds a developer would spend manually navigating those same systems. But it means agents should be designed for asynchronous UX patterns, not synchronous request/response mental models.

Token efficiency vs. call count trade-off: There is a real tension between minimizing context token consumption (many small, focused tool calls) and minimizing latency (fewer calls that return more data). The optimal strategy depends on the model being used. For models with fast inference, more calls is often better. For models where inference itself is the latency bottleneck, batching data into fewer calls reduces total session time even if it uses more tokens.


export CLAUDE_MCP_DEBUG=1

Parallelism: The MCP specification does not prohibit parallel tool calls, and some hosts implement parallel tool execution when the model requests multiple independent tool calls in a single turn. Claude 3.5+ and GPT-4o both support emitting multiple tool calls in one inference response. If your host implements parallel execution, a 10-step workflow where steps 1-3 are independent can run in the latency of max(step1, step2, step3) rather than step1 + step2 + step3. This requires designing your agent prompts to encourage the model to batch independent lookups.


"Gather the following in parallel:
1. The Sentry error details for issue SENTRY-4421
2. The source file src/services/PaymentService.ts from the main branch
3. The Jira ticket PLAT-1203

Once you have all three, cross-reference them and identify the root cause."

Server-side caching: The most reliable MCP latency optimization is implementing caching at the server layer. For reference data (database schemas, Confluence pages, static config files), a 5-minute TTL cache at the MCP server eliminates redundant external API calls across multiple agent sessions. Implement this as a simple in-memory LRU cache in the MCP server process — it does not require infrastructure changes and can reduce API call volume by 70-90% for repeated queries.

from functools import lru_cache
from datetime import datetime, timedelta

_cache: dict = {}
CACHE_TTL_SECONDS = 300

def cached_fetch(key: str, fetch_fn):
    if key in _cache:
        value, fetched_at = _cache[key]
        if datetime.utcnow() - fetched_at < timedelta(seconds=CACHE_TTL_SECONDS):
            return value
    value = fetch_fn()
    _cache[key] = (value, datetime.utcnow())
    return value

Tips
- Instrument every MCP server you build with per-tool call duration metrics from day one. Latency regressions in MCP servers silently degrade agent workflow quality — developers perceive it as "the AI getting slower" without identifying the root cause.
- For latency-critical workflows, profile whether the bottleneck is inference time or tool call time before optimizing. In many cases, inference is 70% of total session time and MCP optimization yields diminishing returns.
- Use parallel tool call patterns deliberately in your prompts for workflows with independent data requirements. The reduction in wall-clock time is proportional to the number of parallelizable calls.
- Size your MCP server's returned payloads based on what the model actually uses, not what is convenient to return. Run a few sessions with debug logging and check what fraction of returned content appears in the model's subsequent reasoning — trim aggressively based on this evidence.