Incremental Summarization | Token Optimization Masterclass

A single-pass summarization — compressing the conversation once at the end — works for short-to-medium sessions. For long agentic sessions lasting hours, spanning dozens of milestones, and accumulating hundreds of turns, you need a different approach: incremental summarization, where the context is continuously distilled into a rolling summary that evolves as the session progresses. The rolling summary is always current, always compact, and always ready to be the foundation for the next turn.

This topic covers the architecture, patterns, and prompting strategies for building a robust incremental summarization system that works across engineering, QA, and product workflows.

The Core Concept: The Living Summary Document

Incremental summarization treats your context not as a linear chat log but as a living document that gets updated rather than replaced. Instead of "compress all history into a summary," the model does "update the existing summary with what just happened."

This distinction is fundamental. A one-shot compression operates on the full history and produces a static artifact. Incremental summarization operates on a delta — the chunk of new turns since the last compression — and merges the delta into the existing summary. The result is a summary that is:

Always current: It reflects the most recent state of the session, not a snapshot from turn 40.
Always compact: Each compression cycle keeps the summary within a target token budget.
Cumulative without growing: Old information is either retained (if still relevant) or overwritten (if superseded), so the summary size is bounded.

The living summary becomes the primary context artifact. Raw conversation history is secondary and is periodically pruned.

Tip: Store your rolling summary as a separate artifact outside the conversation history — a file, a database record, or a structured object in your agent's state. Never let it become just another message in the history array. When it lives in a dedicated field, it is easy to inspect, version, and hand off. When it lives in the history array, it gets buried.

The Architecture of an Incremental Summarization Loop

The incremental loop has four components:

1. The chunk window: New conversation turns accumulate in a "recent turns" buffer. This buffer is separate from the rolling summary. A typical chunk window is 10-20 turns or 15K-25K tokens — large enough that each compression covers meaningful progress, small enough that each compression is fast and cheap.

2. The merge prompt: When the chunk window reaches its threshold, a merge prompt asks the model to integrate the recent turns into the existing summary. The output is a new version of the rolling summary.

3. The pruning step: After merging, the recent turns buffer is cleared (or partially cleared — you may keep the last 2-3 turns for conversational continuity). The rolling summary replaces the history as the primary context.

4. The continuity injection: Every new turn begins with the rolling summary injected into the system prompt or as a prefixed user message. The agent always sees current state, regardless of how long ago earlier decisions were made.

Here is the architecture in code:

from dataclasses import dataclass, field
from anthropic import Anthropic

client = Anthropic()

@dataclass
class IncrementalSessionState:
    rolling_summary: str = ""
    recent_turns: list[dict] = field(default_factory=list)
    turn_count: int = 0
    total_input_tokens: int = 0
    chunk_window_size: int = 15  # compress every 15 turns
    model: str = "claude-sonnet-4-5"
    model_context_limit: int = 200_000

MERGE_PROMPT_TEMPLATE = """You are maintaining a rolling session summary. 

EXISTING SUMMARY:
{existing_summary}

NEW TURNS SINCE LAST UPDATE:
{new_turns}

Update the summary by integrating the new information. Rules:
- Preserve all decisions, constraints, and current work state from the existing summary
- Update sections that have progressed (e.g., if a task is now complete, update "Work Completed")
- Add new decisions and constraints from the new turns
- Update "Do Not Revisit" if any approaches were tried and rejected
- Overwrite superseded information (if a decision was reversed, show the final decision with a note it was revised)
- Keep the summary under {target_tokens} tokens
- Use the same structured format as the existing summary

Output only the updated summary, no preamble."""

def build_context_for_agent(state: IncrementalSessionState) -> list[dict]:
    """Build the messages array the agent sees each turn."""
    context = []

    if state.rolling_summary:
        context.append({
            "role": "user",
            "content": f"[ROLLING SESSION SUMMARY — Current State]\n{state.rolling_summary}\n\n[END SUMMARY]"
        })
        context.append({
            "role": "assistant", 
            "content": "Understood. I have the session context. Ready to continue."
        })

    # Add recent turns (after last compression)
    context.extend(state.recent_turns)
    return context

def compress_incremental(state: IncrementalSessionState) -> None:
    """Merge recent turns into the rolling summary."""

    turns_text = "\n".join([
        f"{t['role'].upper()}: {t['content'][:500]}{'...' if len(t['content']) > 500 else ''}"
        for t in state.recent_turns
    ])

    merge_prompt = MERGE_PROMPT_TEMPLATE.format(
        existing_summary=state.rolling_summary or "(No existing summary — this is the first compression)",
        new_turns=turns_text,
        target_tokens=600  # target summary token size
    )

    response = client.messages.create(
        model=state.model,
        max_tokens=1024,
        messages=[{"role": "user", "content": merge_prompt}]
    )

    state.rolling_summary = response.content[0].text
    # Keep only the last 2 turns for conversational continuity
    state.recent_turns = state.recent_turns[-2:]

def agent_turn(state: IncrementalSessionState, user_message: str) -> str:
    state.recent_turns.append({"role": "user", "content": user_message})
    state.turn_count += 1

    messages = build_context_for_agent(state)

    response = client.messages.create(
        model=state.model,
        max_tokens=4096,
        system="You are an expert software engineering assistant...",
        messages=messages
    )

    agent_reply = response.content[0].text
    state.recent_turns.append({"role": "assistant", "content": agent_reply})
    state.total_input_tokens += response.usage.input_tokens

    # Trigger incremental compression
    if state.turn_count % state.chunk_window_size == 0:
        compress_incremental(state)
        print(f"[Incremental compression complete at turn {state.turn_count}]")

    return agent_reply

Tip: Log the rolling summary to a file after each compression cycle with a timestamp and turn count in the filename (e.g., summary_turn_015.md, summary_turn_030.md). This creates a version history of the session's evolution — invaluable for debugging agent behavior and for post-session review.

The Merge Prompt: Getting the Integration Right

The merge prompt is the most sensitive part of the incremental system. Poor merge prompts produce summaries that silently drop decisions, duplicate information, or grow unbounded. Here are the critical design principles:

Principle 1: Explicit section-by-section instructions. Don't ask the model to "update the summary." Tell it specifically what to do with each section type: preserve decisions, update progress, overwrite superseded information, accumulate discovered issues.

Principle 2: Provide a target token budget. Without a budget, the summary will grow. Specify "keep the summary under X tokens" and the model will make necessary tradeoff decisions.

Principle 3: Handle conflicts explicitly. The prompt must address what to do when a new turn contradicts an existing summary item. The rule: show the current state, note that it was revised, and move on.

Principle 4: Use a consistent structure across cycles. Each compression cycle should produce the same structured format. If the format drifts, summary quality degrades and it becomes harder to programmatically parse sections.

A refined merge prompt for an engineering session:

ROLLING SUMMARY UPDATE TASK

You are updating a running technical session log. Treat the existing summary as ground truth for everything it covers. New turns may extend, complete, or revise items in the summary.

EXISTING SUMMARY:
---
{existing_summary}
---

NEW TURNS (turns {start_turn} to {end_turn}):
---
{turns_text}
---

INSTRUCTIONS BY SECTION:

**Objective**: Keep as-is unless explicitly changed.
**Decisions Made**: Append new decisions. If a decision was reversed, update it with "(revised: [new decision])" and note the reason.
**Constraints**: Append new constraints. Remove constraints explicitly lifted.
**Work Completed**: Add newly completed items. Do not remove items already listed.
**Current State**: Replace entirely with the most current state described in the new turns.
**Open Questions**: Add new open questions. Remove questions that were answered (note the answer briefly).
**Do Not Revisit**: Append any newly rejected approaches.

TARGET LENGTH: Under 550 tokens. Prioritize precision over completeness for verbose sections.

Output the updated summary in the same structured format. No commentary.

Tip: Test your merge prompt by creating synthetic sessions with deliberate contradictions, reversals, and completions, then verify the output handles each case correctly. A 30-minute prompt test session will save hours of debugging in production agentic workflows.

Delta-Based vs. Full-Rewrite Compression Strategies

There are two schools of thought on how the merge step should work:

Delta-based: The model only modifies sections that are affected by the new turns. Unchanged sections are passed through verbatim. This is more efficient (fewer tokens generated) but requires the model to make precise judgments about which sections changed.

Full-rewrite: The model rewrites the entire summary from scratch, using both the existing summary and the new turns as source material. This is more expensive but produces cleaner, more coherent summaries and avoids the accumulation of minor formatting drift over many cycles.

Recommendation by session length:
- Sessions under 4 hours / 60 turns: Full-rewrite compression. Cost is low enough and quality benefits outweigh the extra tokens.
- Sessions over 4 hours / 60 turns: Delta-based for intermediate cycles, full-rewrite every 5th cycle as a "deep clean." This balances cost and quality.

Tip: Implement both strategies and run them in parallel for a few sessions. Compare the summaries manually. You will quickly develop an intuition for which produces better output for your specific session types — this varies more by domain (engineering vs. QA vs. product) than by session length.

Handling Session Interruptions and Resumptions

Long sessions get interrupted. Engineers step away, meetings happen, context switches occur. Incremental summarization makes resumption reliable:

On pause: Trigger a compression cycle immediately, regardless of where the chunk window is. Persist the rolling summary to durable storage (file system, database, Redis). Record the exact turn count, token count, and timestamp.

On resume: Inject the rolling summary as the first context item. Follow it with a brief "resumption prompt" that orients the agent:

[SESSION RESUMED]
The session was paused at turn {turn_count} on {timestamp}.

Rolling summary as of pause:
{rolling_summary}

Please acknowledge the current session state and confirm your understanding of the current task before continuing.

Asking the agent to "acknowledge and confirm" serves two purposes: it validates that the summary was successfully parsed (the agent's confirmation gives you a sanity check), and it surfaces any ambiguities in the summary before they cause mistakes later in the session.

Tip: For sessions that resume across calendar days, add a "what changed since pause" section to your resumption prompt. If the codebase changed, a PR was merged, or a stakeholder decision was made while the session was paused, inject that delta explicitly. The rolling summary captures session state; external changes need to be added manually.

Metrics for Evaluating Incremental Summarization Quality

How do you know your incremental summarization system is working? Measure these four indicators:

1. Summary growth rate. Plot summary token count after each compression cycle. A healthy system shows a bounded, slowly growing summary (reflecting genuine session complexity). An unhealthy system shows a linearly growing summary — the compression is not actually discarding noise.

2. Decision recall rate. Periodically extract all [DECISION]-tagged items from the raw session log and check whether each appears in the current rolling summary. A loss rate above 5% indicates your merge prompt is under-preserving decisions.

3. Contradiction rate. Count how often the agent's output contradicts a decision in the rolling summary. This should be near zero after the first few compression cycles of a session. Elevated contradiction rates suggest the rolling summary structure is not being parsed reliably.

4. Resumption quality. After a pause-and-resume, ask the agent to summarize the session state in its own words. Compare its summary to the rolling summary. Significant divergence indicates the injection pattern is not working.

Tip: Instrument these metrics automatically. After each compression, run a short automated check: extract decision-like sentences from the pre-compression history and verify they appear (paraphrased is fine) in the post-compression summary using an embedding similarity check. This turns summary quality from a subjective judgment into an objective, monitorable metric.

Summary

Incremental summarization transforms a long agentic session from a fragile, ever-growing history into a resilient, self-maintaining knowledge artifact. The rolling summary is always current, always compact, and always ready to anchor the next turn. The architecture — chunk window, merge prompt, pruning, continuity injection — is straightforward to implement and dramatically extends the effective length and reliability of agentic sessions. In the next topic, we apply similar principles specifically to code and diffs, which have their own compression patterns distinct from conversational context.