Sub-Agent Orchestration | Token Optimization Masterclass

The Delegation Decision Problem

Every orchestration layer in a multi-agent system faces the same recurring question: should this work be handled here, in the current context, or delegated to a sub-agent?

The naive answer is "delegate everything — specialization is good." The naive result is a system with 20 agents, 20 separate context windows, 20 system prompts, 20 tool schema payloads, and a massive orchestration overhead that costs more in tokens and latency than the specialization benefit provides.

The considered answer requires understanding the token economics of delegation. Spawning a sub-agent has a fixed overhead cost: the sub-agent's system prompt, its tool schemas, the context payload it needs (including the handoff from the parent), and the parsing/routing logic. Delegation makes economic sense only when the work to be delegated is complex enough and bounded enough that the sub-agent's total cost is lower than handling it in the main context.

This topic gives you the frameworks and rules of thumb to make that delegation decision correctly — for software engineers designing agent architectures, QA engineers structuring test automation hierarchies, and product managers specifying multi-agent feature workflows.

The Token Economics of Delegation

Before building intuition, establish the math. Delegating to a sub-agent incurs these token costs that you would not pay if you handled the work in the main context:

Delegation overhead:
  Sub-agent system prompt:      300–800 tokens (paid on every sub-agent call)
  Sub-agent tool schemas:       500–3,000 tokens (paid on every sub-agent call)
  Context payload (handoff):    varies — the information the sub-agent needs
  Result parsing + routing:     100–300 tokens (orchestrator processes the result)
  ─────────────────────────────────────────────────────────
  Minimum delegation cost:      ~1,000–4,100 tokens

Compare: handling the same work in the main context adds approximately:

In-context handling:
  Tool call JSON:               50–200 tokens (the call itself)
  Tool result injection:        varies — same as the handoff payload
  ─────────────────────────────────────────────────────────
  Minimum in-context cost:      ~50–200 tokens per tool call

The crossover point depends on how long the main context needs to carry the tool result:

If the main context runs for N more iterations after the tool result is needed:
  In-context cost = result_tokens × N (result stays in context)
  Delegation cost = delegation_overhead + result_tokens × 0 (result is NOT in main context)

Delegation wins when:
  delegation_overhead < result_tokens × (N - 1)

For a tool result of 2,000 tokens and delegation overhead of 2,000 tokens, delegation breaks even at N=2 (the result stays in context for 2 or more future iterations). For N=5 or more, delegation saves 6,000+ tokens.

This is the core insight: delegation is worth it when the work would otherwise pollute the main context for many future iterations.

Tip: Before designing any delegation in your agentic system, estimate N — how many iterations the main agent would continue to run after completing the delegated work. If N < 3, in-context handling is usually cheaper. If N > 5, delegation almost always wins. For N between 3 and 5, factor in the quality benefits of specialization to make the call.

The Four Delegation Patterns

Pattern 1: Isolated Analysis (Delegate and Summarize)

Best for: Tasks that require deep analysis of specific content that does not need to stay in the main context.

Main Agent                          Analysis Sub-Agent
    │                                       │
    │── "Analyze file X for issues" ───────>│
    │   [full file content as payload]      │
    │                                       │── read + analyze
    │                                       │── produce structured findings
    │<── findings_summary.json ────────────│
    │   [compact: 200 tokens]              │   [context terminated]
    │                                       
    │  (file content NOT in main context)
    │  (main agent only has the 200-token summary)

Implementation in LangGraph:

def delegate_to_analysis_agent(state: OrchestratorState) -> OrchestratorState:
    """Delegate analysis work that would bloat the main context."""
    for file_item in state["files_to_analyze"]:
        # Sub-agent runs in isolation — its context does not contaminate ours
        findings = analysis_subgraph.invoke({
            "file_content": file_item["content"],  # Full content goes to sub-agent
            "analysis_criteria": state["criteria"]
        })

        # Only the structured summary comes back to main context
        state["analysis_results"].append({
            "file": file_item["path"],
            "findings": findings["structured_findings"],  # ~200 tokens
            # NOT: findings["full_analysis_text"]  # Would be ~2000 tokens
        })

    # Remove files from state — we no longer need them in main context
    state["files_to_analyze"] = []
    return state

Pattern 2: Specialized Execution (Delegate and Forget)

Best for: Tasks that require specialized expertise and produce a simple pass/fail or artifact result.

Main Agent                          Specialist Sub-Agent
    │                                       │
    │── "Write unit tests for module X" ───>│
    │   [module source, test conventions]   │
    │                                       │── generate tests
    │                                       │── run tests
    │                                       │── iterate until passing
    │<── {"status": "done",                │
    │     "test_file": "tests/test_x.py",  │   [context terminated]
    │     "coverage": 0.87}                │
    │   [~50 tokens]                        
    │
    │  (all test generation iterations NOT in main context)

The key characteristic of this pattern: the main agent receives only the final artifact location and status, not the content of the artifact.

Pattern 3: Parallel Specialization (Fan-Out / Fan-In)

Best for: Independent subtasks that can run simultaneously.

                    Orchestrator
                         │
          ┌──────────────┼──────────────┐
          │              │              │
     Sub-Agent A    Sub-Agent B    Sub-Agent C
     (backend)       (frontend)    (database)
          │              │              │
          └──────────────┼──────────────┘
                         │
                   Orchestrator
                   receives 3 compact summaries
                   aggregates into final output

Implementation:

import asyncio
from langgraph.graph import StateGraph

async def parallel_specialist_execution(
    orchestrator_state: dict,
    specialist_configs: list[dict]
) -> list[dict]:
    """Fan out to multiple specialists, collect compact summaries."""

    async def run_specialist(config: dict) -> dict:
        result = await specialist_subgraph.ainvoke({
            "task": config["task"],
            "scoped_context": config["context"]  # Only what this specialist needs
        })
        return {
            "specialist": config["role"],
            "summary": result["summary"],      # Compact
            "artifact_path": result["artifact_path"]  # Reference, not content
        }

    # Run all specialists in parallel
    summaries = await asyncio.gather(*[
        run_specialist(config) for config in specialist_configs
    ])
    return list(summaries)

Pattern 4: Iterative Delegation (Recursive Decomposition)

Best for: Tasks that, when executed, reveal further subtasks that also benefit from delegation.

Orchestrator: "Refactor the authentication system"
  │
  ├── Sub-Agent: "Analyze auth system"
  │     → Returns: list of components to refactor
  │
  ├── Sub-Agent: "Refactor JWTHandler"
  │     ├── Sub-Sub-Agent: "Generate unit tests for JWTHandler"
  │     └── Returns: {"status": "done", "changes": [...]}
  │
  ├── Sub-Agent: "Refactor PasswordHasher"
  │     ├── Sub-Sub-Agent: "Generate unit tests for PasswordHasher"  
  │     └── Returns: {"status": "done", "changes": [...]}
  │
  └── Orchestrator: "Compile summary of all changes"
        Receives: 3 compact summaries (not the full code changes)

Tip: In recursive delegation (Pattern 4), set a maximum delegation depth and enforce it. Unbounded recursion in agentic systems can produce very deep call trees where each level's overhead compounds. A depth limit of 3–4 levels is appropriate for most real-world tasks. Pass the current depth as part of the context payload and have each sub-agent respect it: "You are at delegation depth 2. Solve this problem in-context rather than delegating further."

Signals That Delegation Is the Right Choice

Use these signals to decide when to delegate:

Delegate when:

The work requires reading large content that should not persist in the main context. If an agent needs to read a 500-line file, analyze it, and return a 10-line summary — delegate. The main context never needs to see the 500 lines.
The work is a well-defined, bounded subtask with a clear output contract. If you can write a precise output schema before writing the sub-agent's system prompt, the task is well-enough bounded for delegation.
The work would occupy multiple iterations of the main loop. If handling the work in-context would require 5+ planning-execution cycles, a sub-agent with a specialized prompt will be more efficient (fewer tokens per iteration due to a more focused context).
The work is parallelizable and independent. Fan-out to parallel sub-agents captures this opportunity; the main agent cannot parallelize within its own sequential loop.
The work requires different tool access than the main agent has or needs. Rather than giving the main agent tools it uses rarely (adding to its tool schema tax), delegate tool-specific work to sub-agents with exactly those tools.

Handle in-context when:

The result must be reasoned about immediately in the same context. If the orchestrator needs to tightly interleave analysis and decision-making on the same data, sub-agent latency and handoff overhead breaks the reasoning flow.
The task is trivially short (< 2–3 tool calls). The delegation overhead exceeds the context savings.
The task requires the full history of the main session. If the sub-agent would need to receive the entire main context to function, delegation provides no context benefit.
The system is already at maximum delegation depth.

Tip: Create a delegation decision checklist and apply it to every proposed sub-agent during design reviews. Teams that evaluate delegation against concrete criteria produce leaner, more efficient agent architectures than teams that default to "more specialization is better." A useful one-line test: "If I removed this sub-agent and put its work back in the main loop, would the main context grow by more than 2,000 tokens that would stay there for more than 3 iterations?" If yes, delegate. If no, handle in-context.

Orchestrator Design: The Thin Orchestrator Principle

An orchestrator that accumulates context from all its sub-agents becomes the bottleneck — its context grows as large as the union of all its sub-agents' work. The Thin Orchestrator Principle counteracts this:

The orchestrator's job is routing, not accumulation. It should know where work happened, not what was in the work.

class ThickOrchestrator:
    def run(self, goal: str):
        state = {"goal": goal, "all_results": []}

        for subtask in plan(goal):
            result = run_subtask(subtask)
            state["all_results"].append(result.full_content)  # PROBLEM: accumulates everything

        # By end, state["all_results"] has every line of every subtask output
        return synthesize(state["all_results"])  # Massive context

class ThinOrchestrator:
    def run(self, goal: str):
        plan = create_plan(goal)
        artifact_registry = {}  # Maps task IDs to artifact LOCATIONS, not content

        for subtask in plan:
            result = run_subtask(subtask)
            # Store only location + compact metadata
            artifact_registry[subtask.id] = {
                "artifact_path": result.artifact_path,
                "summary": result.summary,    # 1-2 sentences
                "status": result.status
            }

        # Final synthesis receives only the compact registry
        return synthesize_from_registry(goal, artifact_registry)

The thin orchestrator's context contains only:
- The original goal
- A compact plan (task IDs and one-line descriptions)
- A registry of completed tasks with locations and summaries
- No raw content from sub-agents

Tip: Implement artifact storage as a first-class component of your orchestration architecture. When a sub-agent produces output (a file, a report, a test suite), it writes to a named artifact store. The orchestrator receives a reference (path or ID), not the content. The synthesis step retrieves content from the artifact store only for the specific artifacts it needs. This "pointer-not-copy" pattern keeps orchestrator context O(number of tasks) rather than O(total content produced).

CrewAI: Hierarchical Process for Delegation Control

CrewAI's hierarchical process is built for controlled delegation:

from crewai import Agent, Task, Crew, Process

manager = Agent(
    role="Project Manager",
    goal="Coordinate specialists and compile final deliverable",
    backstory="Delegates to specialists and compiles their structured outputs.",
    allow_delegation=True,
    verbose=False,
    max_iter=10  # Manager stays lean
)

security_specialist = Agent(
    role="Security Analyst",
    goal="Identify and describe security vulnerabilities in code",
    backstory="Security expert. Returns structured JSON findings.",
    allow_delegation=False,  # Specialists don't further delegate
    max_iter=8,
    verbose=False
)

performance_specialist = Agent(
    role="Performance Analyst",
    goal="Identify performance bottlenecks and optimization opportunities",
    backstory="Performance expert. Returns structured JSON findings.",
    allow_delegation=False,
    max_iter=8,
    verbose=False
)

management_task = Task(
    description="""
    Coordinate a code review of the provided codebase.
    Delegate security analysis to the Security Analyst.
    Delegate performance analysis to the Performance Analyst.
    Compile their structured JSON outputs into a final report.
    Do NOT rewrite or paraphrase their outputs — compile them directly.
    """,
    agent=manager,
    expected_output="Combined JSON report with security and performance sections"
)

crew = Crew(
    agents=[manager, security_specialist, performance_specialist],
    tasks=[management_task],
    process=Process.hierarchical,
    manager_agent=manager,
    verbose=False
)

Tip: In CrewAI hierarchical processes, the allow_delegation=False setting on specialist agents is critical for token efficiency. Without it, specialists can delegate back to other specialists, creating chains of delegation that multiply overhead. Set allow_delegation=False on all leaf agents in your hierarchy — only the manager(s) should be able to delegate.

Monitoring Orchestration Efficiency

Track these metrics to know if your orchestration design is working:

@dataclass
class OrchestrationMetrics:
    main_agent_tokens: int = 0
    sub_agent_tokens: dict = field(default_factory=dict)
    delegation_overhead_tokens: int = 0
    handoff_compression_ratio: float = 0.0

    @property
    def delegation_efficiency(self) -> float:
        """Ratio of useful tokens (sub-agent work) to overhead tokens."""
        total_sub = sum(self.sub_agent_tokens.values())
        total_overhead = self.delegation_overhead_tokens
        if total_overhead == 0:
            return 1.0
        return total_sub / (total_sub + total_overhead)

    def report(self):
        print(f"Main agent tokens: {self.main_agent_tokens:,}")
        for agent, tokens in self.sub_agent_tokens.items():
            print(f"  {agent}: {tokens:,} tokens")
        print(f"Delegation overhead: {self.delegation_overhead_tokens:,}")
        print(f"Handoff compression ratio: {self.handoff_compression_ratio:.2f}x")
        print(f"Delegation efficiency: {self.delegation_efficiency:.1%}")

A healthy delegation efficiency is above 0.75 (overhead is less than 25% of total delegation cost). If efficiency drops below 0.5, you have too much delegation overhead relative to the work being delegated — consolidate some sub-agents.

Tip: The handoff compression ratio metric (size of raw sub-agent output divided by size of the summary returned to the orchestrator) is the best single indicator of whether your thin orchestrator principle is being upheld. A ratio of 5x or higher means sub-agents are returning well-compressed summaries. A ratio of 1x means the orchestrator is receiving full outputs and your context savings from delegation are minimal.

Summary

The delegation decision is a token economics question: does the fixed overhead of creating a sub-agent (system prompt, tool schemas, handoff payload) produce enough context savings in the main agent to justify the cost? The crossover point is approximately 3–5 iterations of the main agent after the delegated work would have completed.

The four delegation patterns (Isolated Analysis, Specialized Execution, Parallel Specialization, Recursive Decomposition) cover the full range of multi-agent orchestration scenarios. The Thin Orchestrator Principle — the orchestrator knows where work happened, not what was in it — is the key design principle that keeps orchestration overhead bounded as system complexity grows.

Framework-specific guidance for LangGraph (subgraphs), CrewAI (hierarchical process with allow_delegation=False on specialists), and AutoGen (targeted two-agent conversations rather than long group chats) gives you concrete implementation paths for each approach.