Batching & Chaining Tools | Token Optimization Masterclass

Every round-trip between your agent and the model API has a cost: latency, tokens for the accumulated conversation history, and the overhead of sending the system prompt and tool schemas again. When an agent makes many sequential tool calls — each waiting for the previous result before deciding what to call next — these costs multiply rapidly.

Batching and chaining are two complementary techniques that compress multi-step tool-use workflows. Batching allows a single model response to trigger multiple parallel tool executions. Chaining allows you to pre-sequence tool calls without requiring a new model round-trip for each step. Used together, they can reduce a 10-round-trip workflow to 2–3 round-trips, cutting latency by 60–80% and token costs proportionally.

This topic covers both techniques in depth, with implementation patterns for software engineers, strategies for QA automation workflows, and guidance for product managers overseeing agent performance.

Understanding Round-Trip Cost

To make the case for optimization concrete, consider the token cost of a naive sequential agent that investigates a failing test:

Turn 1: User asks: "Why is the test test_checkout_flow failing?"
- Input tokens: system prompt (500) + user message (15) + tools (600) = 1,115 tokens

Turn 2: Model calls run_tests(pattern="test_checkout_flow")
- Input tokens: system prompt (500) + conversation history (200) + tool result (400) + tools (600) = 1,700 tokens

Turn 3: Model calls read_file("src/checkout.py")
- Input tokens: system prompt (500) + history (600) + new tool result (800) + tools (600) = 2,500 tokens

Turn 4: Model calls search_codebase("CheckoutService")
- Input tokens: system prompt (500) + history (1,400) + new tool result (600) + tools (600) = 3,100 tokens

Turn 5: Model calls read_file("src/payment.py")
- Input tokens: 500 + 2,000 + 500 + 600 = 3,600 tokens

Total input tokens across 5 turns: 12,015

The same investigation, done efficiently with batching, completes in 2–3 turns for roughly 4,000–5,000 total input tokens. The 60% reduction comes from two sources: fewer re-transmissions of the growing conversation history, and the ability to run multiple tools in parallel rather than sequentially.

Tip: Instrument your agent to log per-turn token counts. Most teams are surprised to discover that token costs in turn 5 or 6 are 3–4x higher than turn 1, simply due to the growing conversation history carrying all previous tool results. This compounding effect is the core motivation for batching — fewer turns means a flatter cost curve.

Parallel Tool Calls — The Foundation of Batching

Modern AI APIs natively support returning multiple tool calls in a single model response. The model can say "I need to call tool A and tool B at the same time because they are independent." Your agent executes them concurrently and returns both results in a single turn.

OpenAI parallel function calling:

import openai
import asyncio

client = openai.AsyncOpenAI()

async def execute_tool(tool_name: str, tool_args: dict) -> str:
    """Execute a single tool and return the result."""
    # Your tool execution logic here
    if tool_name == "read_file":
        return read_file(tool_args["path"])
    elif tool_name == "run_tests":
        return run_tests(tool_args.get("pattern"))
    elif tool_name == "git_status":
        return get_git_status()
    return f"Unknown tool: {tool_name}"

async def run_agent_turn(messages: list, tools: list) -> tuple[str, list]:
    """Run one agent turn, executing all tool calls in parallel."""

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )

    choice = response.choices[0]

    if choice.finish_reason != "tool_calls":
        # No tool calls — agent is done
        return choice.message.content, messages

    # Extract all tool calls from this response
    tool_calls = choice.message.tool_calls
    print(f"Model requested {len(tool_calls)} tool call(s) in parallel")

    # Execute all tool calls concurrently
    tasks = [
        execute_tool(tc.function.name, json.loads(tc.function.arguments))
        for tc in tool_calls
    ]
    results = await asyncio.gather(*tasks)

    # Add the assistant message with tool calls
    new_messages = messages + [choice.message]

    # Add all tool results in a single batch
    for tc, result in zip(tool_calls, results):
        new_messages.append({
            "role": "tool",
            "tool_call_id": tc.id,
            "content": result
        })

    return None, new_messages  # Continue the agent loop


async def run_agent(user_message: str, tools: list) -> str:
    """Full agent loop with parallel tool execution."""
    messages = [{"role": "user", "content": user_message}]

    turn_count = 0
    while turn_count < 10:  # Safety limit
        result, messages = await run_agent_turn(messages, tools)
        turn_count += 1

        if result is not None:
            return result

    return "Max turns reached"

Anthropic parallel tool use:

import anthropic
import asyncio

client = anthropic.AsyncAnthropic()

async def run_claude_agent_turn(messages: list, tools: list) -> tuple:
    response = await client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        tools=tools,
        messages=messages
    )

    if response.stop_reason != "tool_use":
        # Find the text response
        text = next(
            (block.text for block in response.content if block.type == "text"),
            ""
        )
        return text, messages

    # Extract all tool use blocks
    tool_uses = [block for block in response.content if block.type == "tool_use"]
    print(f"Executing {len(tool_uses)} tools in parallel")

    # Execute concurrently
    tasks = [execute_tool(tu.name, tu.input) for tu in tool_uses]
    results = await asyncio.gather(*tasks)

    # Build the next message with all tool results
    new_messages = messages + [
        {"role": "assistant", "content": response.content},
        {
            "role": "user",
            "content": [
                {
                    "type": "tool_result",
                    "tool_use_id": tu.id,
                    "content": result
                }
                for tu, result in zip(tool_uses, results)
            ]
        }
    ]

    return None, new_messages

Tip: Guide the model toward parallel tool use with a system prompt instruction. Adding "When multiple pieces of information are needed that are independent of each other, request all of them in a single response rather than making sequential calls" significantly increases the rate at which the model batches tool calls. Without this instruction, many models default to sequential tool calls even when parallelism is safe and beneficial.

Enabling Batching Through Prompt Engineering

The model's willingness to batch tool calls depends heavily on how you frame the task. There are specific prompting patterns that encourage batching:

Pattern 1: Explicit permission for parallel execution

System: You are a code analysis agent. When analyzing code, gather all required information 
in parallel. You may call multiple tools simultaneously when the results are independent. 
Do not wait for one tool's result before calling another if you can identify the calls 
you will need upfront.

Pattern 2: Task decomposition first

System: Before making any tool calls, reason about which tools you will need and whether 
any can be called in parallel. State your plan, then execute all independent calls together.

Pattern 3: Explicit batching request in the user message

user_message = """
Analyze the failing test `test_checkout_flow`. 
In your first response, gather all the information you need simultaneously:
- The test file itself
- The source files it imports  
- Recent git changes to those files
- The current test failure output

Then analyze everything together.
"""

Pattern 4: Structured output to pre-identify tool needs

Ask the model to output a "tool call plan" as structured JSON before executing:

PLANNING_PROMPT = """Before making tool calls, output a JSON plan:
{
  "parallel_groups": [
    ["tool1(args)", "tool2(args)"],  // These can run in parallel
    ["tool3(args)"]                   // This must wait for group 1
  ]
}
Then execute the plan."""

Tip: Test your system prompt's parallel tool calling rate empirically. Run 20–30 representative tasks and count how many tool calls happen in parallel vs. sequentially. A well-tuned system prompt should achieve 60–80% parallel execution for tasks that involve reading multiple independent resources. If you are below 40%, revise the system prompt language — the model is capable of much more parallelism than it shows by default.

Tool Call Chaining — Pre-Sequencing Workflows

Chaining is different from batching: rather than letting the model choose what to call next, you pre-define a sequence of tool calls in your agent code and execute them with minimal model involvement at each step.

This is appropriate for well-understood, deterministic workflows where the sequence of tool calls is known in advance. Think of it as the difference between a fully autonomous agent (batching) and a semi-automated workflow (chaining).

Example: PR preparation workflow

from dataclasses import dataclass
from typing import Callable, Any

@dataclass
class ChainStep:
    tool_name: str
    get_input: Callable[[dict], dict]  # Function that builds input from accumulated context
    result_key: str  # Key to store result in context

class ToolChain:
    def __init__(self, steps: list[ChainStep], tools: dict):
        self.steps = steps
        self.tools = tools

    async def execute(self, initial_context: dict) -> dict:
        """Execute all steps in sequence, passing results between steps."""
        context = initial_context.copy()

        for step in self.steps:
            tool_input = step.get_input(context)
            result = await execute_tool(step.tool_name, tool_input)
            context[step.result_key] = result
            print(f"Completed: {step.tool_name} -> {step.result_key}")

        return context

PR_PREP_CHAIN = ToolChain(
    steps=[
        # Step 1: Get changed files
        ChainStep(
            tool_name="git_diff",
            get_input=lambda ctx: {"staged": True},
            result_key="diff"
        ),
        # Step 2: Get git log for context
        ChainStep(
            tool_name="git_log",
            get_input=lambda ctx: {"n": 5},
            result_key="recent_commits"
        ),
        # Step 3: Run tests to verify nothing is broken
        ChainStep(
            tool_name="run_tests",
            get_input=lambda ctx: {},
            result_key="test_results"
        ),
        # Step 4: Check code style
        ChainStep(
            tool_name="lint_check",
            get_input=lambda ctx: {"files": extract_changed_files(ctx["diff"])},
            result_key="lint_results"
        ),
    ],
    tools={}
)

async def prepare_pr(branch_name: str) -> str:
    """Chain tool calls to gather all PR-relevant data, then make ONE model call."""

    # Execute the chain without model involvement
    context = await PR_PREP_CHAIN.execute({"branch": branch_name})

    # Make a SINGLE model call with all gathered data
    prompt = f"""Based on the following gathered information, write a PR description:

Git Diff:
{context['diff']}

Recent Commits:
{context['recent_commits']}

Test Results:
{context['test_results']}

Lint Results:
{context['lint_results']}

Write a clear, professional PR title and description."""

    response = await client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )

    return response.content[0].text

This approach executes 4 tools with 0 model round-trips during data gathering, then makes 1 final model call with all the data assembled. Compare to a naive agent that would make 4+ round-trips (1 per tool call) plus the final synthesis.

Tip: Use tool chains for any workflow where you can predict the tool sequence from the initial request. PR creation, test generation, deployment checks, and documentation updates are all excellent candidates. The key indicator that a workflow is chain-suitable is that each step's input depends only on the original request context or on clearly predictable transformations of previous steps' outputs — not on model reasoning about what the previous result means.

Hybrid Approach: Batch-Then-Chain

The most sophisticated pattern combines batching and chaining: use the model to batch-gather initial information, then process the results through deterministic chains without additional model calls.

async def hybrid_bug_investigation(bug_report: str) -> str:
    """
    Phase 1: Model identifies what to gather (1 model call, parallel tools)
    Phase 2: Deterministic processing chain (0 model calls)
    Phase 3: Model synthesizes findings (1 model call)
    """

    # Phase 1: Model decides what to gather, executes in parallel
    gather_response = await run_claude_agent_turn(
        messages=[{
            "role": "user",
            "content": f"Bug report: {bug_report}\n\nGather all information needed to diagnose this bug. Request everything in parallel."
        }],
        tools=DIAGNOSTIC_TOOLS
    )

    gathered_data = extract_tool_results(gather_response)

    # Phase 2: Deterministic processing — no model calls needed
    # Parse stack traces, extract file references, check git blame, etc.
    processed_data = {
        "affected_files": extract_file_references(gathered_data),
        "recent_changes": await chain_git_blame(extract_file_references(gathered_data)),
        "related_tests": find_related_tests(extract_file_references(gathered_data)),
        "error_patterns": parse_error_patterns(gathered_data)
    }

    # Phase 3: Single model call for synthesis
    synthesis_prompt = format_synthesis_prompt(bug_report, gathered_data, processed_data)

    final_response = await client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2000,
        messages=[{"role": "user", "content": synthesis_prompt}]
    )

    return final_response.content[0].text

Tip: When designing hybrid workflows, identify the "reasoning bottleneck" — the step that genuinely requires model intelligence — versus the "data gathering steps" that are mechanical. Use the model only where reasoning is required; use deterministic code for everything else. In most investigative workflows (bug analysis, PR review, test generation), data gathering is mechanical and synthesis is the reasoning step. Confining model involvement to synthesis dramatically reduces round-trips and cost.

Implementing Round-Trip Budgets

A practical tool for enforcing batching discipline is a round-trip budget: a hard limit on the number of model API calls allowed per user request. This forces your agent design to be efficient from the start.

class BudgetedAgent:
    def __init__(self, max_turns: int = 5, max_tool_calls_per_turn: int = 5):
        self.max_turns = max_turns
        self.max_tool_calls = max_tool_calls_per_turn
        self.turn_count = 0
        self.tool_call_count = 0

    def can_continue(self) -> bool:
        return self.turn_count < self.max_turns

    async def run(self, user_message: str, tools: list) -> str:
        messages = [{"role": "user", "content": user_message}]

        while self.can_continue():
            self.turn_count += 1

            response = await client.messages.create(
                model="claude-opus-4-5",
                max_tokens=4096,
                tools=tools,
                messages=messages
            )

            if response.stop_reason != "tool_use":
                return response.content[0].text

            tool_uses = [b for b in response.content if b.type == "tool_use"]
            self.tool_call_count += len(tool_uses)

            # Log batching efficiency
            if len(tool_uses) > 1:
                print(f"Turn {self.turn_count}: {len(tool_uses)} tools in parallel (good)")
            else:
                print(f"Turn {self.turn_count}: 1 tool call (consider batching)")

            # Execute all tools
            results = await asyncio.gather(*[
                execute_tool(tu.name, tu.input) for tu in tool_uses
            ])

            # Update messages
            messages = messages + [
                {"role": "assistant", "content": response.content},
                {"role": "user", "content": [
                    {"type": "tool_result", "tool_use_id": tu.id, "content": r}
                    for tu, r in zip(tool_uses, results)
                ]}
            ]

        return f"Budget exceeded after {self.turn_count} turns"

Tip: Set your round-trip budget based on empirical data from your agent's most common tasks, not on theoretical maximums. If 90% of successful task completions happen in 3 turns or fewer, set your budget at 5 turns. Tasks requiring more than 5 turns are usually candidates for workflow restructuring — either through better batching or through pre-built tool chains — rather than simply raising the budget.

Measuring Batching Efficiency

Track these metrics to understand and improve your agent's batching behavior:

@dataclass
class AgentSessionMetrics:
    total_turns: int
    total_tool_calls: int
    parallel_tool_calls: int  # Tool calls made in groups > 1
    sequential_tool_calls: int  # Tool calls made one at a time
    avg_tools_per_turn: float
    total_input_tokens: int
    total_output_tokens: int

    @property
    def batching_rate(self) -> float:
        """Percentage of tool calls that were batched in parallel groups."""
        if self.total_tool_calls == 0:
            return 0.0
        return self.parallel_tool_calls / self.total_tool_calls

    @property
    def efficiency_score(self) -> float:
        """Higher is better: tools per turn * batching rate."""
        return self.avg_tools_per_turn * self.batching_rate

def create_session_report(session: AgentSessionMetrics) -> str:
    return f"""
Agent Session Metrics:
- Total API calls: {session.total_turns}
- Total tool calls: {session.total_tool_calls}
- Batching rate: {session.batching_rate:.1%}
- Avg tools/turn: {session.avg_tools_per_turn:.1f}
- Efficiency score: {session.efficiency_score:.2f}
- Total tokens: {session.total_input_tokens + session.total_output_tokens:,}
"""

Tip: Batching rate below 30% is a signal that your system prompt is not effectively encouraging parallel execution. Batching rate above 70% with more than 3 tools per turn is excellent. Average tools per turn is a leading indicator for session cost: agents averaging 3+ tools per turn complete tasks in 2–3 turns; agents averaging 1 tool per turn take 6–9 turns for the same tasks, at roughly 3x the token cost.

Summary

Batching and chaining are the architectural-level optimizations in tool use efficiency. They attack the fundamental inefficiency of sequential round-trips: every extra round-trip re-transmits the growing conversation history, re-sends tool schemas, and adds API latency. Batching — using the model's native parallel tool call capability — reduces turns by 50–70% for tasks involving multiple information-gathering steps. Chaining — pre-sequencing deterministic tool calls in your agent code — eliminates model involvement entirely for predictable workflow steps. The hybrid approach combines both, achieving the best results: the model handles reasoning and parallelism, while deterministic chains handle mechanical data transformation between reasoning steps.